• 沒有找到結果。

Real Examples

5.3 Breast Tissue Data

立 政 治 大 學

N a

tio na

l C h engchi U ni ve rs it y

5.3 Breast Tissue Data

The third data set contains the electrical impedance spectroscopy of 106 breast tissue samples, which originally are divided into six categories. In Silva et al. (2000), all sub-jects are classified into two classes: 52 are non-diseased tissues and 54 are diseased tissues.

Here we consider the same classification. Nine biomarkers are available for disease detec-tion. Full data information and records can be found at http://archive.ics.uci.edu/ml/

datasets/Breast+Tissue.

The dimension of this example is too large to apply a grid search. Hence, the up-per panel of Table 5.5 only presents the optimal linear combinations obtained via our algorithm and the linear combinations of Liu et al. (2005) for comparison. In Table 5.5, the linear combinations at t = 0.1, 0.2 are almost the same, and the major contributing variables are the first, forth, sixth, seventh, eighth and ninth markers. With a greater t, the second and third biomarkers contribute the most to the linear combination. Overall our linear combination performs better then the linear combination of Liu et al. (2005), and the difference in pAUC increases as t. From these results, we conclude that the optimal linear combination maximizing pAUC can be sensitive to the specificity range.

In contrast, when using the AUC as the target criterion or the solution proposed by Liu et al. (2005), we always obtain a universal solution. Hence, the use of the pAUC criterion and the use of a proper numerical algorithm for calculation can provide a more subtle evaluation.

From the raw data of the breast tissue example, we find that the values of some biomarkers are large and their corresponding variances are also large, but some biomarkers do not have these characteristics. Hence, in order to avoid having to order the coefficients of the best linear combination based on their variances, we suggest using a standardization before using the proposed test-based biomarker selection. Different standardizations will create different coefficients of the best linear combination and different ordering of the coefficients. In the following step, every biomarker in the raw data is simply divided by its pooled sample standard deviation from the two groups for a more standardized unit across biomarkers. Hence, the marker selection analysis is based on the standardized data.

‧ 國

立 政 治 大 學

N a

tio na

l C h engchi U ni ve rs it y

Table 5.5: The coefficients of the optimal linear combination and the corresponding pAUC value for the specificity range (1−t, 1) in the breast tissue example.

I. Finding the optimal linear combination

t Method I0 PA500 HFS DA AREA A/DA MAX IP DR P pAUC\n

0.1 Multiple-initial -0.2617 0.0047 0.0047 -0.3314 -0.0015 0.6763 0.2267 0.5223 0.2003 0.0482 Liu et al. (2005) 0.0000 0.9915 0.1299 0.0001 0.0000 0.0004 -0.0005 -0.0001 0.0000 0.0480 0.2 Multiple-initial -0.2415 0.0050 0.0039 -0.3591 -0.0014 0.6692 0.2529 0.5185 0.1787 0.1264 Liu et al. (2005) 0.0000 0.9915 0.1299 0.0001 0.0000 0.0004 -0.0005 -0.0001 0.0000 0.1107 0.3 Multiple-initial -0.0061 0.6455 -0.7634 -0.0115 0.0000 0.0099 0.0091 0.0140 0.0042 0.2037 Liu et al. (2005) 0.0000 0.9915 0.1299 0.0001 0.0000 0.0004 -0.0005 -0.0001 0.0000 0.1755 0.4 Multiple-initial -0.0061 0.6455 -0.7634 -0.0115 0.0000 0.0099 0.0091 0.0140 0.0042 0.3260 Liu et al. (2005) 0.0000 0.9915 0.1299 0.0001 0.0000 0.0004 -0.0005 -0.0001 0.0000 0.2440 0.5 Multiple-initial -0.0061 0.6455 -0.7634 -0.0115 0.0000 0.0099 0.0091 0.0140 0.0042 0.4461 Liu et al. (2005) 0.0000 0.9915 0.1299 0.0001 0.0000 0.0004 -0.0005 -0.0001 0.0000 0.3156 II. Biomarker selection

t Method I0 PA500 HFS DA AREA A/DA MAX IP DR P pAUC\n

0.1 Full set (raw) -0.2617 0.0047 0.0047 -0.3314 -0.0015 0.6763 0.2267 0.5223 0.2003 0.0482 Full set (standardized) -0.5722 0.2835 0.0284 -0.2962 -0.1643 0.0911 -0.0378 0.3914 0.5601 0.0591 Forward Selection 0.0000 0.8213 0.0000 0.0000 -0.3577 0.3843 -0.2234 0.0000 0.0000 0.0575 Backward Selection -0.7310 0.0000 0.0000 -0.1085 -0.0883 0.0597 0.0000 0.2616 0.6117 0.0466 LASSO(λmin) -0.5722 0.2835 0.0284 -0.2962 -0.1643 0.0911 -0.0378 0.3914 0.5601 0.0591 LASSO(λ1SE) -0.0877 0.9916 0.0000 0.0000 0.0000 -0.0949 0.0000 0.0000 0.0000 0.0513 0.2 Full set (raw) -0.2415 0.0050 0.0039 -0.3591 -0.0014 0.6692 0.2529 0.5185 0.1787 0.1264 Full set (standardized) -0.6213 0.1277 0.0015 -0.3396 -0.1115 0.0485 0.0529 0.4206 0.5365 0.1337 Forward Selection -0.3809 0.9246 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.1213 Backward Selection -0.5883 0.1158 0.0000 -0.4038 -0.1238 0.0656 0.0830 0.4688 0.4808 0.1338 LASSO(λmin) -0.6213 0.1277 0.0015 -0.3396 -0.1115 0.0485 0.0529 0.4206 0.5365 0.1337

66

‧ 國

立 政 治 大 學

N a

tio na

l C h engchi U ni ve rs it y

Table 5.6: The Forward and the Backward selections for the specificity range (0.9,1) in the breast tissue example.

I. Forward selection

Step Biomarker entries Test statistic Test value p-value Biomarker selected

1 I0 pAUC\ 0.0002 1.000

2 P pAUC\ 0.0001 1.000

3 DR pAUC\ 0.0002 1.000

4 DA pAUC\ 0.0004 1.000

5 PA500 pAUC\ 0.0490 0.000 PA500

6 AREA ˆaAREA -0.2470 0.038 PA500, AREA

7 A/DA ˆaA/DA 0.2196 0.044 PA500, AREA, A/DA

8 MAX IP ˆaMAX IP -0.2234 0.026 PA500, AREA, A/DA, MAX IP

9 HFS ˆaHFS 0.0873 0.372 PA500, AREA, A/DA, MAX IP

II. Backward selection

Step Biomarker assessed Test statistic Test value p-value Biomarker selected

1 All pAUC\n 0.0591 0.000 I0, PA500, HFS, DA, AREA, A/DA, MAX IP, DR, P

2 HFS ˆaHFS 0.0284 0.332 I0, PA500, DA, AREA, A/DA, MAX IP, DR, P

3 MAX IP ˆaMAX IP -0.0322 0.300 I0, PA500, DA, AREA, A/DA, DR, P 4 A/DA ˆaA/DA 0.2859 0.004 I0, PA500, DA, AREA, A/DA, DR, P 5 AREA ˆaAREA -0.3082 0.010 I0, PA500, DA, AREA, A/DA, DR, P

6 PA500 ˆaPA500 0.5200 0.060 I0, DA, AREA, A/DA, DR, P

7 DA ˆaDA -0.1085 0.028 I0, DA, AREA, A/DA, DR, P

8 DR ˆaDR 0.2616 0.012 I0, DA, AREA, A/DA, DR, P

9 P ˆaP 0.6117 0.000 I0, DA, AREA, A/DA, DR, P

10 I0 ˆaI0 -0.7310 0.002 I0, DA, AREA, A/DA, DR, P

Note: indicates a significance at α = 5%.

67

‧ 國

立 政 治 大 學

N a

tio na

l C h engchi U ni ve rs it y

The biomarker selection results of t = 0.1, 0.2 are present in the lower panel of Table 5.5, which contains the estimated best linear combination and the corresponding pAUC for the specificity range (1− t, 1). First we find that using the standardized full data set not only produces different optimal linear combination, but also obtains a greater pAUC value. It shows that our proposed algorithm fails to find the global maximum when the dimension becomes large. After data standardization, the number of the major contributing variables, which have their coefficients further away from zero, increases.

Both the second and the fifth biomarkers are contained in the set of the major contributing variables.

Next we apply the proposed biomarker selection procedures on the standardized data set. From this table, we find that the Forward method has better performances than the Backward method when t = 0.1. But when t = 0.2, the Backward method is superior to the Forward method. In addition, the selected significant biomarker sets of the two selection methods are totally different. In specific, at t = 0.1, the Forward method discards the top four biomarkers (in terms of the magnitude of the correspondent coefficient in the optimal linear combination of the full data set)–which are IO, P, DR and DA, but the Backward method selects these four biomarkers. Surprisingly, from the stepwise details at t = 0.1 given in Table 5.6, all these four biomarkers are found to have insignificant results in testing their marginal pAUCs.

In order to compare our biomarker selection methods with LASSO method, the result of the optimal linear combination of the reduced biomarker sets, which are selected via the LASSO method, are also present in the lower panel of Table 5.5. Two different λ values are used: one is achieving the minimum mean cross-validation error, denoted as λmin; the other is the largest value of λ such that its error is within 1 standard error of the minimum, denoted as λ1SE. From the lower panel of Table 5.5, we find that using λmin in the LASSO, the most conservative result in which all biomarkers are selected is obtained.

The biomarker set selected via the LASSO with λ1SE is different from the biomarker set selected via our method. In view of the optimal sample pAUC for application, when t = 0.1, the method is better than the Backward method, but is worse than the Forward method. When t = 0.2, the method is better than the Forward method, but is worse than

‧ 國

立 政 治 大 學

N a

tio na

l C h engchi U ni ve rs it y

Table 5.7: The distributions of two populations, their corresponding pAUC for the speci-ficity range (0.9,1), and two measure indicators in the breast tissue example with stan-dardization.

I. Individual distribution

D = 0 D = 1

Biomarker a∗Tµ0 Q0 a∗Tµ1 Q1 pAUC\n q

Q0

Q1 | a∗Tµ1−aQ1∗Tµ0 | I0 0.0000 2.0024 -1.6187 0.0354 0.0002 7.52 8.60 P 0.0000 1.9912 -1.4715 0.0462 0.0001 6.57 6.85 DR 0.0000 1.8654 -0.8923 0.1673 0.0002 3.34 2.18 DA 0.0000 1.8721 -1.0244 0.1608 0.0004 3.41 2.55 PA500 0.0000 0.4252 1.1153 1.5531 0.0490 0.52 0.89 AREA 0.0000 2.0071 -0.5167 0.0309 0.0000 8.06 2.94 A/DA 0.0000 1.7155 -0.3562 0.3115 0.0001 2.35 0.64 MAX IP 0.0000 1.9371 -0.9185 0.0982 0.0000 4.44 2.93 HFS 0.0000 0.8450 0.2780 1.1491 0.0120 0.86 0.26 II. Optimal linear combination after marker selection

D = 0 D = 0

Method a∗Tµ0 Q0 a∗Tµ1 Q1 pAUC\n q

Q0

Q1 | a∗Tµ1−aQ1∗Tµ0 | Forward 0.0000 0.2949 1.1690 1.2116 0.0575 0.49 1.06 Backward 0.0000 0.0125 0.1852 0.0093 0.0466 1.16 1.92

the Backward method. The analyses are performed by using the cv.glment package of R software with deviance loss and 10-fold cross-validation.

For the purpose of investigating the relationship between the pAUC and the marginal distributions of the individual biomarkers in the breast tissue example, we report the sample mean and the sample variance of every biomarker within the two groups in the top panel of Table 5.7. These biomarkers appear according to the descending order based on the absolute coefficients in the optimal linear combination of the full data set. Two measures to describe the heterogeneity between the two groups are also reported in Table 5.7. One is the ratio of the standard deviation of the non-diseased population to that of the diseased population, pQ0/Q1. The other is the absolute value of a standardized mean difference, where the mean difference is divided by the standard deviation of the diseased group. | (a∗Tµ1− a∗Tµ0)/√

Q1 |.

Additionally, the corresponding density plots of each biomarker within the two groups

‧ 國

立 政 治 大 學

N a

tio na

l C h engchi U ni ve rs it y

are given in Figure 5.1. In which, the density function of the diseased population is in red, while that of the non-diseased population is in blue. Consider the following diagnostic rule of a single biomarkers: the subject has a positive diagnosis if the observed value of the biomarker exceeds some critical value. In each figure of Figure 5.1, the vertical line x = c is the cutoff point corresponding to the upper limit t = 0.1 of the 1-specificity. Consequently, the marginal pAUC of each biomarker is the integration of all right-tail probabilities in the diseased distribution. Similarly, Table 5.7 and Figure 5.2 offer the information of the distributions of the optimal linear combinations of the reduced biomarker sets based on the two biomarker selection methods.

From these results, we find that pQ0/Q1 is critical in determining the contribution of the biomarker. When the diseased population is relatively more heterogeneous than the non-diseased population, i.e. pQ0/Q1 ≈ 0, the integration of the pAUC produces a greater value and hence a larger pAUC value. On the contrary, a more homogeneous diseased group (pQ0/Q1 ≈ ∞) tends to generate a smaller pAUC value. Hence the top four biomarkers although have obviously distinct characteristics between the two groups, and have strong associations with the disease, they are insignificant in testing the marginal pAUC. However, no matter which selection method we use, too small a value of

| (a∗Tµ1− a∗Tµ0)√

Q1 | (for example, HFS) may lead to an insignificance.

Finally, from the bottom of Table 5.7, although having a smaller| (a∗Tµ1−a∗Tµ0)√ Q1 |, the optimal linear combination via the Forward method has a greater pAUC than that via the Backward method due to having a lower pQ0/Q1.

相關文件