All the testing results were summarized in table 6.1. To compare with our signature, the testing results of using TNM tumor stage as a classifier and the best-performing method (method A) from Shedden et al. (2008) were also included in table 6.1. Here the TNM tumor stage classifier was simply separating the all stage patients into two groups; stage I and late stage. The
CHAPTER 6. SIGNATURE VALIDATION 52 stage I patients was separated by stage IA or IB. Method A gave a contin-uous risk score constructed by using the average expression profiles of 100 clusters to fit ridged Cox proportional hazard model. We noted that the results of method A were analyzed from the expression profiles preprocessed by implementing dChip algorithm for entire training and testing data set.
In CAN/DF and MSK two testing sets, all of these four methods performed good for the all stage patients prediction. Only our gene signature, both continuous and categorical types, had all hazard ratios significantly greater than 1 for all stage and stage I patients in both data sets. The hazard ratio of method A was not significantly greater than 1 in CAN/DF data. The hazard ratios of TNM tumor stage IA and IB were not significantly greater than 1 in both testing sets. Furthermore, the hazard ratio of TNM tumor stage IA and IB was smaller than 1 in CAN/DF data set. This result did not suggest using the IA and IB as a classifier for stage I patients. In the external validation cohort data from Duke University, the hazard ratios of our signature were also significantly great than 1 for the patients of both the all stage and early stage samples. We noticed that our signature performed better for stage I than all the patients, in this data set. We found that there were five stage IV and fifteen stage IIIB patients in this data set. However, there were only 11 stage IIIB patients and no stage IV patients in training data set. Furthermore, there were five of these late stage patients died within half years. This might be a reason that all patients prediction performed not good as the TNM tumor stage.
The Kaplan-Meier curves were given to illustrate the difference of survival functions between high risk and low risk groups. The Kaplan-Meier curves for classifying patients by TNM tumor stages were also given. Due to the small
CHAPTER 6. SIGNATURE VALIDATION 53 sample size, highly censored rate and the relatively homogeneous samples, classifying CAN/DF into different risk groups was much harder than other data set. The Kaplan-Meier curve showed that our signature had reasonable good prediction power in such a data set. The significant p-value for the Duke validation set showed that our gene signature, derived from adenocarcinomas patients only, had potential to predict patients with different tumor types.
We concluded that our gene signature had good prediction power for all stage or early stage non-small cell lung cancer patients.
CHAPTER 6. SIGNATURE VALIDATION 54
Table 6.1: Validation results in CAN/DF, MSK and Duke data sets CAN/DF All stage Hazard ratio 95% C.I. p-value CPE
Risk score 1.65 (1.17, 2.31) 0.002 0.662
Categorical 3.96 (1.68, 9.34) 0.001 0.651
TNM stage 3.25 (1.54, 6.84) 0.002 0.616
Method A 0.57 (1.20, 2.60) 0.003 0.623
CAN/DF stage I Hazard ratio 95% C.I. p-value CPE
Risk score 1.59 (1.01, 2.51) 0.036 0.666
Categorical 3.78 (1.04,13.74) 0.027 0.648
TNM stage 0.55 (0.17, 1.80) 0.347 0.546
Method A 1.29 (0.84, 1.98) 0.243 0.574
MSK All stage Hazard ratio 95% C.I. p-value CPE
Risk score 1.68 (1.13, 2.51) 0.012 0.614
Categorical 2.65 (1.29, 5.45) 0.006 0.614
TNM stage 3.87 (1.91, 7.85) 0.000 0.642
Method A 1.83 (1.24, 2.70) 0.002 0.627
MSK stage I Hazard ratio 95% C.I. p-value CPE
Risk score 2.23 (1.14, 4.35) 0.023 0.654
Categorical 11.89 (1.53, 92.16) 0.001 0.715
TNM stage 2.60 (0.70, 9.63) 0.127 0.611
Method A 2.10 (1.15, 3.84) 0.014 0.656
Duke All stage Hazard ratio 95% C.I. p-value CPE
Risk score 1.22 (1.02, 1.47) 0.032 0.580
Categorical 1.71 (1.01, 2.87) 0.043 0.566
TNM stage 2.17 (1.29, 3.63) 0.004 0.589
Duke stage I Hazard ratio 95% C.I. p-value CPE
Risk score 1.44 (1.10, 1.87) 0.007 0.635
Categorical 2.93 (1.36, 6.34) 0.005 0.625
TNM stage 1.97 (0.95, 4.10) 0.070 0.580
CHAPTER 6. SIGNATURE VALIDATION 55
0 10 20 30 40 50 60
0.00.20.40.60.81.0
Training - All stage
Time (months)
Proportion alive Low score (n=128)
High score (n=128)
Proportion alive Low score (n=55)
High score (n=56) Cat. p= 0.04181 CPE= 0.57 Score p= 0.03227 CPE= 0.58
0 10 20 30 40 50 60
0.00.20.40.60.81.0
CAN/DF - All stage
Time (months)
Proportion alive Low score (n=41)
High score (n=41)
Proportion alive Low score (n=52)
High score (n=52) Cat. p= 0.00596 CPE= 0.61 Score p= 0.01223 CPE= 0.61
Figure 6.1: Kaplan-Meier curves for all stage samples separated by gene signature
CHAPTER 6. SIGNATURE VALIDATION 56
0 10 20 30 40 50 60
0.00.20.40.60.81.0
Training - Stage I
Time (months)
Proportion alive Low score (n=79)
High score (n=80) Cat. p= 0.00095 CPE= 0.61 Score p= 0.00021 CPE= 0.65
0 10 20 30 40 50 60
0.00.20.40.60.81.0
Duke - Stage I
Time (months)
Proportion alive Low score (n=33)
High score (n=34) Cat. p= 0.00422 CPE= 0.62 Score p= 0.00663 CPE= 0.63
0 10 20 30 40 50 60
0.00.20.40.60.81.0
CAN/DF - Stage I
Time (months)
Proportion alive Low score (n=28)
High score (n=28) Cat. p= 0.03002 CPE= 0.65 Score p= 0.03591 CPE= 0.67
0 10 20 30 40 50 60
0.00.20.40.60.81.0
MSK - Stage I
Time (months)
Proportion alive Low score (n=31)
High score (n=32) Cat. p= 0.00246 CPE= 0.71 Score p= 0.02304 CPE= 0.65
Figure 6.2: Kaplan-Meier curves for stage I samples separated by gene sig-nature
CHAPTER 6. SIGNATURE VALIDATION 57
0 10 20 30 40 50 60
0.00.20.40.60.81.0
Training - All stage
Time (months)
Proportion alive Stage I (n=159)
Stage II~IV (n=97) TNM stage I vs II~IV p= 0 CPE= 0.63
0 10 20 30 40 50 60
0.00.20.40.60.81.0
Duke - All stage
Time (months)
Proportion alive Stage I (n=67)
Stage II~IV (n=44) TNM stage I vs II~IV p= 0.00269 CPE= 0.59
0 10 20 30 40 50 60
0.00.20.40.60.81.0
CAN/DF - All stage
Time (months)
Proportion alive Stage I (n=56)
Stage II~IV (n=26) TNM stage I vs II~IV p= 0.00107 CPE= 0.62
0 10 20 30 40 50 60
0.00.20.40.60.81.0
MSK - All stage
Time (months)
Proportion alive Stage I (n=63)
Stage II~IV (n=41) TNM stage I vs II~IV p= 6e-05 CPE= 0.64
Figure 6.3: Kaplan-Meier curves for all stage samples separated by TNM stage
CHAPTER 6. SIGNATURE VALIDATION 58
0 10 20 30 40 50 60
0.00.20.40.60.81.0
Training - Stage I
Time (months)
Proportion alive Stage IA (n=77)
Stage IB (n=82)
Proportion alive Stage IA (n=40)
Stage IB (n=27)
Proportion alive Stage IA (n=11)
Stage IB (n=45)
Proportion alive Stage IA (n=27)
Stage IB (n=36) TNM stage IA vs IB p= 0.03236 CPE= 0.64
Figure 6.4: Kaplan-Meier curves for stage I samples separated by TNM stage
Chapter 7
Summary and discussion
7.1 Summary
We reanalyzed a large adenocarcinomas data from Shedden et al. (2008) and derived a gene signature from seven gene expression profiles. Tested in two independent validation data sets, our gene signature had significant prediction power for the survival of samples of all stage patients or stage I pa-tients only. Furthermore, our signature also had significant prediction power in an external NSCLC data set that contained two different tumor cell types.
Most of the available analysis procedures contain three conceptions; ini-tial data filter, feature selection and signature construction. Our analysis procedure also contains three steps; gene filter, gene selection and signature construction. Compared with other methods, our analysis procedure has several advantages. First, we proposed a new criterion to filter out some inconsistent genes. Second, the gene selection and the signature construc-tion are both supervised by the patient survivals. Third, some non-linear interactive structures are considered in our procedure. Fourth, the model
59
CHAPTER 7. SUMMARY AND DISCUSSION 60 assumption for the gene selection and the signature construction part in our procedure is the least. Furthermore, the interpretation of our signature is easy and clear. However, there are still some unsolved issues. Although our LA hub gene selecting procedure shows the significancy of the selected LA hub gene, but the significancy of its paired genes is not showed. A method to screen the selected paired genes is need. However, in practice, we may rely on some Biological knowledge to choose the paired genes. In our analysis procedure, there is only one LA pair used in the signature construction part.
However, if there are more LA pairs selected, a better model to incorporate the LA pairs in the dimension reduction model is needed.