• 沒有找到結果。

Signature construction in the training data set

After the gene filter and the gene selection two steps, 10 candidate genes were selected to construct the gene signature, which included 5 genes X1, X2, ..., X5 selected by correlation method, 1 hub gene X6 and 4 genes X7, X8, ..., X10 paired with the hub gene selected by liquid association method. Wu et al. (2008) suggested applying the modified sliced inverse regression on the genes selected by correlation method and the LA hub genes. However, we found that in their data example, the four genes with the greatest weights

CHAPTER 5. SIGNATURE CONSTRUCTION 41 (absolute value) were selected by correlation method and the weights of the LA hub genes were relative small. It might not be suitable to use only the LA hub genes without their paired genes, since the change of survival time was related to the functionally-associated pattern based on the LA methodology.

Therefore, we thought that the genes paired with the LA hub gene were not negligible for survival prediction.

Since correlation coefficient measured the linear dependency of two vari-ables, the dimension reduction model assumption that survival time Y depended on the gene expression profiles (X1, X2, ..., X5) only trough its linear combinations was reasonable. Nevertheless, this assumption might not be suitable when there were both correlation genes and LA pair genes, due to the nonlinear conception of liquid association. Thus, to incorporate the LA pairs into the dimension reduction model, we added the interac-tion terms of LA pairs as regressors. However, we did not suggest adding all the interaction pairs. Although the significance of LA hub gene was showed in the previous chapter, the genes paired with it might be selected by chance, due to the correlated structure of thousands of genes. Here we presented how this happened with a simple simulation. First we gener-ated variables (X1, X2, ..., X5) from multivariate normal distribution with mean 0, variance 1 and equal correlation 0.2 for each two variables. We independently generated another cluster of genes (Z1, Z2, ..., Z20) from mul-tivariate normal distribution with mean 0, variance 1 and equal correla-tion 0.7 for each two variables. The response variable Y was generated by Y = exp(0.5X1+ 0.5X2+ 0.5X3+ 0.5X4+ 0.5X5Z1+ (0.5)2ϵ), where ϵ was generated from standard normal distribution independent to X’s and Z’s.

200 independent variable W1, W2, ..., W200 were generated from multivariate

CHAPTER 5. SIGNATURE CONSTRUCTION 42 normal distribution with mean 0 and covariance matrix (ΣW)ij = 0.9|i−j|. After 1,000 simulation runs, the average number of X5 appeared in the first 10 LA pairs was 6.52 and there were 873 times X5 appeared more than one time in the first 10 LA pairs. It showed that several paired genes might be found by chance even there was only one true paired gene. Therefore, to be conservative, we did not incorporate all the LA pairs into the dimension reduction model.

Since our final goal was to derive a gene signature to predict the survival for all stage patients and stage I patients, especially the stage I patients.

Our strategy was to incorporate the LA pair that improved the prediction power of derived signature most. For the derived gene signature by modified sliced inverse regression, the prediction power of it was presented in two ways. First we used the gene signature as a continuous risk score r = βx, where β was the significant SIR direction, to fit the Cox proportional hazard model, λ(y | r) = λ0(y)eγr, and calculated the p-value for testing the null hypothesis that hazard ratio eγ was equal to 1. To present the prediction power, we calculated the concordance probability estimate (CPE) for the probability that survival outcome agreed with the signature P (Y1 > Y2 | γr1 ≤ γr2). Gönen and Heller (2005) proposed that under Cox proportional hazard model, the concordance probability can be expressed by

P (Y1 > Y2 | γr1 ≤ γr2) = P (Y1 > Y2, γr1 ≤ γr2) where the last equation was followed by the proportional hazard assumption.

CHAPTER 5. SIGNATURE CONSTRUCTION 43 Then the concordance probability could be estimated by

CP E(ˆγ) = 2 We noted that the concordance probability estimate was in the range from 0.5 to 1. A CPE close to 1 indicated the good prediction power of the signature, and a CPE close to 0.5 indicated the poor prediction power of the signa-ture. Second we used the signature to separate the patients into two groups;

high risk and low risk, by cutting at the median of the signature. Then we used the category classifier to fit the Cox proportional hazard model and evaluated the hazard ratio, the corresponding p-value and the concordance probability estimate. For the survival prediction of stage I patients, we used the same genes and the same coefficients β estimated from all stage samples to construct the signature for stage I patients. The two different risk groups were separated by cutting at the median of signature in samples of stage I patients only.

In practice, first we started from only the five genes selected by correlation method. We noted that we did not perform normal quantile transformation on any variables in this part, since it is somewhat unrealistic in the test set.

The expression profiles preprocessed by the MAS 5.0 Statistical algorithm were used as our raw data. Then we took log-2 transformation and centered each gene expression profile at its sample mean. For the five genes selected by correlation method, we implemented the modified sliced inverse regression method and selected the only significant (p-value < 0.05) SIR direction by the large sample chi-squared test. We projected the expression profiles on the only SIR direction as our final gene signature. The Kaplan-Meier sur-vival functions for two different risk groups separated by our signature, the corresponding p-values and the concordance probability estimate are given

CHAPTER 5. SIGNATURE CONSTRUCTION 44

0 10 20 30 40 50 60

0.00.20.40.60.81.0

All stage - 5 correlation genes

Time (months)

Proportion alive Low score (n=128)

High score (n=128) Cat. p= 0 CPE= 0.61 Score p= 0 CPE= 0.67

0 10 20 30 40 50 60

0.00.20.40.60.81.0

Stage I - 5 correlation genes

Time (months)

Proportion alive Low score (n=79)

High score (n=80) Cat. p= 0.0618 CPE= 0.56 Score p= 0.00093 CPE= 0.63

Figure 5.1: Kaplan-Meier survival curves for all stage and stage I samples in training data set separated by gene signature constructed by only five correlation genes.

as figure 5.1.

Figure 5.1 showed that the signature constructed by only five correla-tion genes had significant prediccorrela-tion power for all stage patients but not for the stage I patients only. However, since survival prediction for early stage patients is a more important issue, we wanted to incorporate LA gene pair for improving the prediction power for sample of stage I patients only. Each LA pair was incorporated to the dimension reduction model by adding their interaction term and the main effect terms. Specifically, we used X = (X1, X2, ..., X5, X6, Xi, X6Xi) as the regressors in the dimension reduction model, where i = 7, ..., 10. The interaction term was added for the nonlinear association of the LA pair with respect to the survival time, and the main effect terms were added for adjusting the miss centered issue for the interaction term. We applied the modified sliced inverse regression for each

CHAPTER 5. SIGNATURE CONSTRUCTION 45

0 10 20 30 40 50 60

0.00.20.40.60.81.0

All Stage - (SRP54,SART3) added

Time (months)

Proportion alive Low score (n=128)

High score (n=128) Cat. p= 0 CPE= 0.63 Score p= 0 CPE= 0.69

0 10 20 30 40 50 60

0.00.20.40.60.81.0

All Stage - (SRP54,NR2C1) added

Time (months)

Proportion alive Low score (n=128)

High score (n=128) Cat. p= 0 CPE= 0.63 Score p= 0 CPE= 0.69

0 10 20 30 40 50 60

0.00.20.40.60.81.0

All Stage - (SRP54,CROP) added

Time (months)

Proportion alive Low score (n=128)

High score (n=128) Cat. p= 0 CPE= 0.65 Score p= 0 CPE= 0.68

0 10 20 30 40 50 60

0.00.20.40.60.81.0

All Stage - (SRP54,PAWR) added

Time (months)

Proportion alive Low score (n=128)

High score (n=128) Cat. p= 0 CPE= 0.62 Score p= 0 CPE= 0.69

Figure 5.2: Kaplan-Meier survival curves for all stage samples in training data separated by gene signature constructed by five correlation genes and one LA pair.

regressor X to find the SIR direction. For each case, there was exact one significant SIR direction selected by the large sample chi-suared test. The results were given as figure 5.2 and 5.3.

Figure 5.3 showed that incorporating the LA pair did improve the pre-diction power for stage I patient only. We chose the best-performing gene signature constructed by five correlation genes and the LA pair (SRP54, PAWR) to be our final gene signature. To combine our signature and the

CHAPTER 5. SIGNATURE CONSTRUCTION 46

0 10 20 30 40 50 60

0.00.20.40.60.81.0

Stage I - (SRP54,SART3) added

Time (months)

Proportion alive Low score (n=79)

High score (n=80) Cat. p= 0.01003 CPE= 0.59 Score p= 0.00053 CPE= 0.64

0 10 20 30 40 50 60

0.00.20.40.60.81.0

Stage I - (SRP54,NR2C1) added

Time (months)

Proportion alive Low score (n=79)

High score (n=80) Cat. p= 0.02409 CPE= 0.58 Score p= 0.00026 CPE= 0.64

0 10 20 30 40 50 60

0.00.20.40.60.81.0

Stage I - (SRP54,CROP) added

Time (months)

Proportion alive Low score (n=79)

High score (n=80) Cat. p= 0.0032 CPE= 0.6 Score p= 0.00037 CPE= 0.64

0 10 20 30 40 50 60

0.00.20.40.60.81.0

Stage I - (SRP54,PAWR) added

Time (months)

Proportion alive Low score (n=79)

High score (n=80) Cat. p= 0.00095 CPE= 0.61 Score p= 0.00021 CPE= 0.65

Figure 5.3: Kaplan-Meier survival curves for stage I samples in training data separated by gene signature constructed by five correlation genes and one LA pair.

clinical covariates (TNM stage, sex ,age) for survival prediction, we fitted them with the multivariate Cox proportional hazard model, where the TNM tumor stage was coded as a four levels factor as in Chapter 3. After a stepwise selection, our gene signature, age and TNM stage III were still significant in the multivariate Cox proportional hazard model. The estimated sliced in-verse regression direction was given in table 5.1, and the details of hazard ratio for univariate and multivariate Cox proportional hazard model were given in table 5.2.

CHAPTER 5. SIGNATURE CONSTRUCTION 47

The genes with negative coefficients are called protect genes, because the increase of its expression is associated with the decrease of hazard ratio.

On the other hand, the genes with positive coefficients are called risk genes because the increase of its expression is associated with the increase of hazard ratio. Table 5.1 showed that these coefficients agreed with our results in gene selection part. All the protect genes had positive correlation with the imputed survival time and all the risk genes had the negative correlation with the imputed survival time. The coefficient of the interaction term of the LA pair was negative, which also agreed with its positive LA score. The absolute values of the coefficients and the standard deviations also presented the strength of the genes affecting our signature. Table 5.1, showed that the products of the coefficient and standard deviation of the regressors were closed except the main effect terms of the LA pair. Thus, all these genes gave important effects for our gene signature. Table 5.2 showed the significant of our gene signature in the Cox proportional hazard model. Furthermore, the p-values of the multivariate Cox proportional hazard model showed that our gene signature was still significant even we incorporated the TNM tumor stage and age.

CHAPTER 5. SIGNATURE CONSTRUCTION 48

Table 5.1: The estimated coefficients of SIR direction

Variable SIR dir. coefficient S.D. SIR dir. coefficient*S.D.

TMEM66 -0.6457 (Protect) 0.3862 -0.2494

CSRP1 -0.4606 (Protect) 0.4738 -0.2182

BECN1 -1.1296 (Protect) 0.3671 -0.4147

FOSL2 0.2288 (Risk) 0.8292 0.1897

ERO1L 0.3333 (Risk) 0.9534 0.3178

(SRP54) 0.0253 ( - ) 0.5447 0.0138

(PAWR) 0.1045 ( - ) 0.6669 0.0697

SRP54*PAWR -0.7517 (Protect) 0.4647 -0.3493

p-value 0.032

Table 5.2: Hazard ratio with the corresponding 95% confident interval, p-value and the CPE of our gene signature

UM+MICH - All stage Hazard ratio 95% C.I. p-value CPE Risk score 2.22 (1.77, 2.79) 1.50e-13 0.688 Categorical 2.86 (1.96, 4.18) 1.20e-08 0.621 UM+MICH - Stage I Hazard ratio 95% C.I. p-value CPE

Risk score 1.83 (1.31, 2.57) 0.0002 0.648

Categorical 2.55 (1.43, 4.52) 0.0009 0.610

CHAPTER 5. SIGNATURE CONSTRUCTION 49

Table 5.3: Hazard ratios with the corresponding 95% confident intervals and p-values of our gene signature and clinical covariates

Multivariate Hazard ratio 95% C.I. p-value Risk score 1.85 (1.48,2.32) 6.98e-08

age 1.02 (1.01,1.04) 9.66e-03

Stage IB 1.28 (0.73,2.25) 3.84e-01 Stage II 2.62 (1.47,4.69) 1.14e-03 Stage III 4.72 (2.68,8.34) 8.41e-08 Multivariate Hazard ratio 95% C.I. p-value Categorical 2.24 (1.51, 3.31) 5.60e-05

age 1.03 (1.01, 1.05) 7.38e-03

Stage IB 1.37 (0.78, 2.41) 2.74e-01 Stage II 2.79 (1.55, 5.02) 5.94e-03 Stage III 5.33 (3.04, 9.36) 5.57e-09

Chapter 6

Signature validation

6.1 Validation procedure

To test the predication power of our gene signature derived from the training data, we reconstructed our gene signature in two independent testing data sets, CAN/DF and MSK. First we applied the MAS 5.0 Statistical algorithm to get the raw data in the test sets. We chose the same probe sets selected from the training data and took log-2 transformation as in the training data. Second, we centered the testing data set at the sample means in the training data. Then, we used the same coefficients derived from the training data to combine the expression profiles into one gene signature as a risk score. We also separated the patients into two groups; high risk and low risk by cutting at the median risk score in the testing set. To present the prediction power, both the continuous risk score and the categorical classifier were used to fit the Cox model. The hazard ratios with the corresponding p-values and the CPE were evaluated. For the samples of stage I patients only, we applied the same procedure with the same probe sets, linear combination coefficients to get the same risk score. The median of the risk score of the

50

CHAPTER 6. SIGNATURE VALIDATION 51 stage I patients in each testing set was used to be the cutoff as a categorical classifier.

相關文件