Signature construction in the training data set

After the gene ﬁlter and the gene selection two steps, 10 candidate genes were selected to construct the gene signature, which included 5 genes X₁, X₂, ..., X₅ selected by correlation method, 1 hub gene X₆ and 4 genes X₇, X₈, ..., X10 paired with the hub gene selected by liquid association method. Wu et al. (2008) suggested applying the modiﬁed sliced inverse regression on the genes selected by correlation method and the LA hub genes. However, we found that in their data example, the four genes with the greatest weights

CHAPTER 5. SIGNATURE CONSTRUCTION 41 (absolute value) were selected by correlation method and the weights of the LA hub genes were relative small. It might not be suitable to use only the LA hub genes without their paired genes, since the change of survival time was related to the functionally-associated pattern based on the LA methodology.

Therefore, we thought that the genes paired with the LA hub gene were not negligible for survival prediction.

Since correlation coeﬀicient measured the linear dependency of two vari-ables, the dimension reduction model assumption that survival time Y^◦ depended on the gene expression proﬁles (X₁, X₂, ..., X₅)^′ only trough its linear combinations was reasonable. Nevertheless, this assumption might not be suitable when there were both correlation genes and LA pair genes, due to the nonlinear conception of liquid association. Thus, to incorporate the LA pairs into the dimension reduction model, we added the interac-tion terms of LA pairs as regressors. However, we did not suggest adding all the interaction pairs. Although the signiﬁcance of LA hub gene was showed in the previous chapter, the genes paired with it might be selected by chance, due to the correlated structure of thousands of genes. Here we presented how this happened with a simple simulation. First we gener-ated variables (X₁, X₂, ..., X₅)^′ from multivariate normal distribution with mean 0, variance 1 and equal correlation 0.2 for each two variables. We independently generated another cluster of genes (Z₁, Z₂, ..., Z₂₀)^′ from mul-tivariate normal distribution with mean 0, variance 1 and equal correla-tion 0.7 for each two variables. The response variable Y^◦ was generated by Y^◦ = exp(0.5X1+ 0.5X2+ 0.5X3+ 0.5X4+ 0.5X5Z1+ (0.5)²ϵ), where ϵ was generated from standard normal distribution independent to X’s and Z’s.

200 independent variable W1, W2, ..., W200 were generated from multivariate

CHAPTER 5. SIGNATURE CONSTRUCTION 42 normal distribution with mean 0 and covariance matrix (Σ_W)_ij = 0.9^|i−j|. After 1,000 simulation runs, the average number of X₅ appeared in the ﬁrst 10 LA pairs was 6.52 and there were 873 times X₅ appeared more than one time in the ﬁrst 10 LA pairs. It showed that several paired genes might be found by chance even there was only one true paired gene. Therefore, to be conservative, we did not incorporate all the LA pairs into the dimension reduction model.

Since our ﬁnal goal was to derive a gene signature to predict the survival for all stage patients and stage I patients, especially the stage I patients.

Our strategy was to incorporate the LA pair that improved the prediction power of derived signature most. For the derived gene signature by modiﬁed sliced inverse regression, the prediction power of it was presented in two ways. First we used the gene signature as a continuous risk score r = β^′x, where β was the signiﬁcant SIR direction, to ﬁt the Cox proportional hazard model, λ^◦(y^◦ | r) = λ^◦₀(y^◦)e^γr, and calculated the p-value for testing the null hypothesis that hazard ratio e^γ was equal to 1. To present the prediction power, we calculated the concordance probability estimate (CPE) for the probability that survival outcome agreed with the signature P (Y₁^◦ > Y₂^◦ | γr₁ ≤ γr2). Gönen and Heller (2005) proposed that under Cox proportional hazard model, the concordance probability can be expressed by

P (Y₁^◦ > Y₂^◦ | γr1 ≤ γr2) = P (Y₁^◦ > Y₂^◦, γr₁ ≤ γr2) where the last equation was followed by the proportional hazard assumption.

CHAPTER 5. SIGNATURE CONSTRUCTION 43 Then the concordance probability could be estimated by

CP E(ˆγ) = 2 We noted that the concordance probability estimate was in the range from 0.5 to 1. A CPE close to 1 indicated the good prediction power of the signature, and a CPE close to 0.5 indicated the poor prediction power of the signa-ture. Second we used the signature to separate the patients into two groups;

high risk and low risk, by cutting at the median of the signature. Then we used the category classiﬁer to ﬁt the Cox proportional hazard model and evaluated the hazard ratio, the corresponding p-value and the concordance probability estimate. For the survival prediction of stage I patients, we used the same genes and the same coeﬀicients β estimated from all stage samples to construct the signature for stage I patients. The two diﬀerent risk groups were separated by cutting at the median of signature in samples of stage I patients only.

In practice, ﬁrst we started from only the ﬁve genes selected by correlation method. We noted that we did not perform normal quantile transformation on any variables in this part, since it is somewhat unrealistic in the test set.

The expression proﬁles preprocessed by the MAS 5.0 Statistical algorithm were used as our raw data. Then we took log-2 transformation and centered each gene expression proﬁle at its sample mean. For the ﬁve genes selected by correlation method, we implemented the modiﬁed sliced inverse regression method and selected the only signiﬁcant (p-value < 0.05) SIR direction by the large sample chi-squared test. We projected the expression proﬁles on the only SIR direction as our ﬁnal gene signature. The Kaplan-Meier sur-vival functions for two diﬀerent risk groups separated by our signature, the corresponding p-values and the concordance probability estimate are given

CHAPTER 5. SIGNATURE CONSTRUCTION 44

0 10 20 30 40 50 60

0.00.20.40.60.81.0

All stage - 5 correlation genes

Time (months)

Proportion alive Low score (n=128)

High score (n=128) Cat. p= 0 CPE= 0.61 Score p= 0 CPE= 0.67

0 10 20 30 40 50 60

0.00.20.40.60.81.0

Stage I - 5 correlation genes

Time (months)

Proportion alive Low score (n=79)

High score (n=80) Cat. p= 0.0618 CPE= 0.56 Score p= 0.00093 CPE= 0.63

Figure 5.1: Kaplan-Meier survival curves for all stage and stage I samples in training data set separated by gene signature constructed by only ﬁve correlation genes.

as ﬁgure 5.1.

Figure 5.1 showed that the signature constructed by only ﬁve correla-tion genes had signiﬁcant prediccorrela-tion power for all stage patients but not for the stage I patients only. However, since survival prediction for early stage patients is a more important issue, we wanted to incorporate LA gene pair for improving the prediction power for sample of stage I patients only. Each LA pair was incorporated to the dimension reduction model by adding their interaction term and the main eﬀect terms. Speciﬁcally, we used X = (X₁, X₂, ..., X₅, X₆, X_i, X₆X_i)^′ as the regressors in the dimension reduction model, where i = 7, ..., 10. The interaction term was added for the nonlinear association of the LA pair with respect to the survival time, and the main eﬀect terms were added for adjusting the miss centered issue for the interaction term. We applied the modiﬁed sliced inverse regression for each

CHAPTER 5. SIGNATURE CONSTRUCTION 45

0 10 20 30 40 50 60

0.00.20.40.60.81.0

All Stage - (SRP54,SART3) added

Time (months)

Proportion alive Low score (n=128)

High score (n=128) Cat. p= 0 CPE= 0.63 Score p= 0 CPE= 0.69

0 10 20 30 40 50 60

0.00.20.40.60.81.0

All Stage - (SRP54,NR2C1) added

Time (months)

Proportion alive Low score (n=128)

High score (n=128) Cat. p= 0 CPE= 0.63 Score p= 0 CPE= 0.69

0 10 20 30 40 50 60

0.00.20.40.60.81.0

All Stage - (SRP54,CROP) added

Time (months)

Proportion alive Low score (n=128)

High score (n=128) Cat. p= 0 CPE= 0.65 Score p= 0 CPE= 0.68

0 10 20 30 40 50 60

0.00.20.40.60.81.0

All Stage - (SRP54,PAWR) added

Time (months)

Proportion alive Low score (n=128)

High score (n=128) Cat. p= 0 CPE= 0.62 Score p= 0 CPE= 0.69

Figure 5.2: Kaplan-Meier survival curves for all stage samples in training data separated by gene signature constructed by ﬁve correlation genes and one LA pair.

regressor X to ﬁnd the SIR direction. For each case, there was exact one signiﬁcant SIR direction selected by the large sample chi-suared test. The results were given as ﬁgure 5.2 and 5.3.

Figure 5.3 showed that incorporating the LA pair did improve the pre-diction power for stage I patient only. We chose the best-performing gene signature constructed by ﬁve correlation genes and the LA pair (SRP54, PAWR) to be our ﬁnal gene signature. To combine our signature and the

CHAPTER 5. SIGNATURE CONSTRUCTION 46

0 10 20 30 40 50 60

0.00.20.40.60.81.0

Stage I - (SRP54,SART3) added

Time (months)

Proportion alive Low score (n=79)

High score (n=80) Cat. p= 0.01003 CPE= 0.59 Score p= 0.00053 CPE= 0.64

0 10 20 30 40 50 60

0.00.20.40.60.81.0

Stage I - (SRP54,NR2C1) added

Time (months)

Proportion alive Low score (n=79)

High score (n=80) Cat. p= 0.02409 CPE= 0.58 Score p= 0.00026 CPE= 0.64

0 10 20 30 40 50 60

0.00.20.40.60.81.0

Stage I - (SRP54,CROP) added

Time (months)

Proportion alive Low score (n=79)

High score (n=80) Cat. p= 0.0032 CPE= 0.6 Score p= 0.00037 CPE= 0.64

0 10 20 30 40 50 60

0.00.20.40.60.81.0

Stage I - (SRP54,PAWR) added

Time (months)

Proportion alive Low score (n=79)

High score (n=80) Cat. p= 0.00095 CPE= 0.61 Score p= 0.00021 CPE= 0.65

Figure 5.3: Kaplan-Meier survival curves for stage I samples in training data separated by gene signature constructed by ﬁve correlation genes and one LA pair.

clinical covariates (TNM stage, sex ,age) for survival prediction, we ﬁtted them with the multivariate Cox proportional hazard model, where the TNM tumor stage was coded as a four levels factor as in Chapter 3. After a stepwise selection, our gene signature, age and TNM stage III were still signiﬁcant in the multivariate Cox proportional hazard model. The estimated sliced in-verse regression direction was given in table 5.1, and the details of hazard ratio for univariate and multivariate Cox proportional hazard model were given in table 5.2.

CHAPTER 5. SIGNATURE CONSTRUCTION 47

The genes with negative coeﬀicients are called protect genes, because the increase of its expression is associated with the decrease of hazard ratio.

On the other hand, the genes with positive coeﬀicients are called risk genes because the increase of its expression is associated with the increase of hazard ratio. Table 5.1 showed that these coeﬀicients agreed with our results in gene selection part. All the protect genes had positive correlation with the imputed survival time and all the risk genes had the negative correlation with the imputed survival time. The coeﬀicient of the interaction term of the LA pair was negative, which also agreed with its positive LA score. The absolute values of the coeﬀicients and the standard deviations also presented the strength of the genes aﬀecting our signature. Table 5.1, showed that the products of the coeﬀicient and standard deviation of the regressors were closed except the main eﬀect terms of the LA pair. Thus, all these genes gave important eﬀects for our gene signature. Table 5.2 showed the signiﬁcant of our gene signature in the Cox proportional hazard model. Furthermore, the p-values of the multivariate Cox proportional hazard model showed that our gene signature was still signiﬁcant even we incorporated the TNM tumor stage and age.

CHAPTER 5. SIGNATURE CONSTRUCTION 48

Table 5.1: The estimated coeﬀicients of SIR direction

Variable SIR dir. coeﬀicient S.D. SIR dir. coeﬀicient*S.D.

TMEM66 -0.6457 (Protect) 0.3862 -0.2494

CSRP1 -0.4606 (Protect) 0.4738 -0.2182

BECN1 -1.1296 (Protect) 0.3671 -0.4147

FOSL2 0.2288 (Risk) 0.8292 0.1897

ERO1L 0.3333 (Risk) 0.9534 0.3178

(SRP54) 0.0253 ( - ) 0.5447 0.0138

(PAWR) 0.1045 ( - ) 0.6669 0.0697

SRP54*PAWR -0.7517 (Protect) 0.4647 -0.3493

p-value 0.032

Table 5.2: Hazard ratio with the corresponding 95% conﬁdent interval, p-value and the CPE of our gene signature

UM+MICH - All stage Hazard ratio 95% C.I. p-value CPE Risk score 2.22 (1.77, 2.79) 1.50e-13 0.688 Categorical 2.86 (1.96, 4.18) 1.20e-08 0.621 UM+MICH - Stage I Hazard ratio 95% C.I. p-value CPE

Risk score 1.83 (1.31, 2.57) 0.0002 0.648

Categorical 2.55 (1.43, 4.52) 0.0009 0.610

CHAPTER 5. SIGNATURE CONSTRUCTION 49

Table 5.3: Hazard ratios with the corresponding 95% conﬁdent intervals and p-values of our gene signature and clinical covariates

Multivariate Hazard ratio 95% C.I. p-value Risk score 1.85 (1.48,2.32) 6.98e-08

age 1.02 (1.01,1.04) 9.66e-03

Stage IB 1.28 (0.73,2.25) 3.84e-01 Stage II 2.62 (1.47,4.69) 1.14e-03 Stage III 4.72 (2.68,8.34) 8.41e-08 Multivariate Hazard ratio 95% C.I. p-value Categorical 2.24 (1.51, 3.31) 5.60e-05

age 1.03 (1.01, 1.05) 7.38e-03

Stage IB 1.37 (0.78, 2.41) 2.74e-01 Stage II 2.79 (1.55, 5.02) 5.94e-03 Stage III 5.33 (3.04, 9.36) 5.57e-09

Chapter 6 Signature validation

6.1 Validation procedure

To test the predication power of our gene signature derived from the training data, we reconstructed our gene signature in two independent testing data sets, CAN/DF and MSK. First we applied the MAS 5.0 Statistical algorithm to get the raw data in the test sets. We chose the same probe sets selected from the training data and took log-2 transformation as in the training data. Second, we centered the testing data set at the sample means in the training data. Then, we used the same coeﬀicients derived from the training data to combine the expression proﬁles into one gene signature as a risk score. We also separated the patients into two groups; high risk and low risk by cutting at the median risk score in the testing set. To present the prediction power, both the continuous risk score and the categorical classiﬁer were used to ﬁt the Cox model. The hazard ratios with the corresponding p-values and the CPE were evaluated. For the samples of stage I patients only, we applied the same procedure with the same probe sets, linear combination coeﬀicients to get the same risk score. The median of the risk score of the

CHAPTER 6. SIGNATURE VALIDATION 51 stage I patients in each testing set was used to be the cutoﬀ as a categorical classiﬁer.

在文檔中生物晶片資料分析與肺腺癌存活之預測 (頁 51-62)