行政院國家科學委員會專題研究計畫 成果報告
接受者操作特徵函數線下面積之無母數迴歸分析
研究成果報告(精簡版)
計 畫 類 別 : 個別型
計 畫 編 號 : NSC 98-2118-M-004-004-
執 行 期 間 : 98 年 08 月 01 日至 99 年 07 月 31 日
執 行 單 位 : 國立政治大學統計學系
計 畫 主 持 人 : 薛慧敏
共 同 主 持 人 : 張源俊
計畫參與人員: 碩士班研究生-兼任助理人員:劉世凰
博士班研究生-兼任助理人員:許嫚荏
報 告 附 件 : 出席國際會議研究心得報告及發表論文
處 理 方 式 : 本計畫涉及專利或其他智慧財產權,2 年後可公開查詢
中 華 民 國 99 年 10 月 28 日
PARTIAL AREA UNDER THE ROC CURVE(PAUC)
Abstract. When there are multiple markers associated with a disease, except the trivial case, a combination of these markers often performs better than individual ones. Here we focus on the class of linear combinations for an easy and clear interpretation. The AUC (area under the receiver op-erating characteristic curve) criterion proposed for evaluation of medical diagnostic tools nowadays becomes more and more popular in assessing the discriminating power of a binary classification rule with continuous-scale. However, in some real applications, only a limited region of specificity is of research interest, and hence the partial AUC(pAUC) is a more adequate criterion. The goal of this study is to find the best linear combination that maximizes the pAUC. Under the normality assumption, the partial derivative of the pAUC is obtained and has a complex form. Hence the finding of the maximizer(s) is a difficult task. In this study, an algorithm is developed. Intensive numerical studies are conducted for assessment of the algorithm. It’s found that the algorithm has adequate empirical performance.
1. Method
Notation and Basic Results
Consider p biomarkers as diagnostic tools for a specific disease and a large value of the biomarker favors a positive diagnosis. Let X and Y be the vectors of p variables for non-diseased and diseased population, respectively. Suppose
X ∼ MN(µx, Σx), Y ∼ MN(µy, Σy),
where µx, µy ∈ Rp and Σx and Σy are p× p matrices. Let a ∈ Rp be a vector of coefficients. Then,
given a, the linear combinations aTY, aTY follow
VD¯ ≡ aTX ∼ N(aTµx, aTΣxa) (1.1)
VD ≡ aTY ∼ N(aTµy, aTΣya). (1.2)
Define ∆µ= µy− µx, Qx = aTΣxa and Qy = aTΣya. Let FD¯ and FD be the distribution functions
of VD¯ and VD, and S = 1− F . It follows that at fixed a, the ROC curve can be derived as
ROC(u; a) = SD[SD−1¯ (u)] = 1− Φ [ SD−1¯ (u)− aTµy √ Qy ] = 1− Φ [ c(u)√Q√x− aT∆µ Qy ] (1.3)
1991 Mathematics Subject Classification.
Key words and phrases.
where u is some level of false positive rate and c(u) = Φ−1(1− u). Su and Liu(1993) showed that the coefficients for the best linear combination are
a∗∝ Σx−1/2(I + Σ−1/2x ΣyΣ−1/2x )−1Σ−1/2x ∆µ= (Σx+ Σy)−1∆µ,
and the maximal AUC is
Φ (√
∆muT(Σx+ Σy)−1∆mu
)
.
Further, the partial area under ROC curve for a given t ∈ (0, 1) of the linear combination is given by pAU Ct(a) = ∫ t 0 ROC(u; a)du = ∫ t 0 ( 1− Φ [ c(u)√Q√x− aT∆µ Qy ]) du. (1.4)
Optimal Linear Combination
To find the best linear combination of biomarkers that maximizes the pAUC for a given t, it suffices to find a, such that
∂pAU C(a)
∂a = 0. (1.5)
Theorem 1.1. The coefficient vector of the best linear combination of the p biomarkers, a0, is proportional to (w1Σx+ w2Σy)−1∆µ, (1.6) where w1 = c1 aT0∆µ Qx+ Qy + c2Qy, w2= c1 aT0∆µ Qx+ Qy − c2 Qx. In which, c1 = √ 2πσΦ ( ν− c(t) σ ) , c2 = σ2exp [ −(c(t)− ν)2 2σ2 ] ·√ 1 QxQy , further, ν = aT0∆µ √ Qx/(Qx+ Qy), σ2 = Qy/(Qx+ Qy). Proof. Let A = a T∆ µ Qy Σya− ∆µand B = Σxa √ Qx − √ Qx Qy Σya, and ν = aT∆µ √
Qx/(Qx+ Qy), σ2 = Qy/(Qx+ Qy). Then Eq. (1.5) can be shown to have the
following form, A ∫ ∞ c(t) exp [ −(y− ν)2 2σ2 ] dy + B ∫ ∞ c(t) y· exp [ −(y− ν)2 2σ2 ] dy = 0. (1.7)
Note that ∫ ∞ c(t) exp [ −(y− ν)2 2σ2 ] dy =√2πσΦ ( ν− c(t) σ ) , and ∫ ∞ c(t) y· exp [ −(y− ν)2 2σ2 ] dy = σ2exp [ −(c(t)− ν)2 2σ2 ] +√2πνσΦ ( ν− c(t) σ ) .
It follows that (1.7) becomes
c1(A + Bν) + c2B √ QxQy = 0, (1.8) where c1 = √ 2πσΦ ( ν− c(t) σ ) , c2 = σ2exp [ −(c(t)− ν)2 2σ2 ] ·√ 1 QxQy . Because A + Bν = a T∆ µ Qx+ Qy (Σx+ Σy)a− ∆µ, B √ QxQy = QyΣxa− QxΣya, (1.8) becomes c1 [ aT∆µ Qx+ Qy (Σx+ Σy)a− ∆µ ] + c2[QyΣxa− QxΣya] = 0,
which implies that
c1(µy− µx) = (w1Σx+ w2Σy)a (1.9) where w1 = c1 aT∆µ Qx+ Qy + c2Qy, w2= c1 aT∆µ Qx+ Qy − c2 Qx.
It is known that the pAUC is invariant to the scale, so the best linear combination is
a0∝ (w1Σx+ w2Σy)−1(µy− µx).
2. Algorithm
From Theorem 1.1, solving for the optimal coefficient vector is equivalent to a fixed-point prob-lem,
a0= f (a0).
We consider the following ”naive” iterated algorithm:
Step 0. Calculate the coefficients a∗ of the best linear combination wrt AUC,
a∗= (Σx+ Σy)−1∆µ.
Step 1. Use a∗ as the initial a(0)0 = a∗, compute the corresponding pAUC by (1.4). Denote the pAUC by pAU C(0).
Step 2. Calculate f (a(0)0 ). If f (a(0)0 )· ∆µ> 0, then a(1)0 = f (a (0) 0 ). Otherwise, a (1) 0 =−f(a (0) 0 )
Step 3. Normalized a(1)0 to have unit norm and compute the corresponding pAUC, pAU C(1).
Step 4. Calculate the increment δ(1) = pAU C(1)− pAUC(0). Then
a. If δ(1)<−², find the first two significant biomarkers according to their absolute
mag-nitudes in a(1)0 and record as b(1)2×1. Also define b(0) from a(0)0 . Find the angle between
b(0) and b(1). Rotate b(1) from b(0) by the same angle in reverse direction. Renew a(1)0 by combining the rotated vector with the original a(1)0 . Moreover, recalculate pAU C(1) and δ(1).
b. If at the first stage,|δ(1)| < ², again find b(1) and b(0). Rotate b(0) counterclockwise by some θ1. Renew a(1)0 by combine the rotated vector with the original a
(0)
0 . Moreover,
recalculate pAU C(1), δ(1). Repeat 2-4.
• If δ(1)> ², repeat Step 2-4. Otherwise, stop and the let the last a
0 be the final solution. • Note: Here ² = 10−8, θ1= π/8.
For a simulated data set with an equi-correlation structure, the algorithm is robust to the selec-tion of the initial value and all the convergence is monotone upward. The algorithm performs well with p up to 100. See Case 1. However, for the two examples in Liu et al.(2005), the convergent point of the algorithm is sensitive to the initial value and the monotone convergence is no longer present. Practically, we suggest using the coefficients a∗ of the best linear combination for AUC by Su and Liu(1993) as the initial value. See the results of Case 2-3.
Case 1: Slightly positive equi-correlation Data: p = 100
• Mean: µx = 0, µy = (0,· · · , 0, 1.02, 1.04, · · · , 2)T. (50 effective biomarkers)
• Variance: Σx = Ip. In Σy, σi,i = 1, and σi,j = ρ, for i6= j. Here ρ = ±0.2.
• The cutoff for pAUC, t = 0.2. • ² = 10−6.
Results:
Initial Iterations for convergence pAUC
V(1) 3 0.2
V(2) 3 0.2
V(p) 3 0.2
Case 2: Liu et al.(SIM, 2005) Data: p = 4 • Mean: µx = (14.46, 23.89, 7.29, 17.68)T, µy = (15, 61, 25.83, 8.2, 19.21)T. • Variance: Σx= 1.88 −3.20 −2.10 −1.85 −3.20 13.33 3.31 7.39 −2.10 3.31 4.67 4.10 −1.85 7.39 4.10 10.33 , Σy = 13.56 −7.28 −5.79 −6.95 −7.28 23.23 7.12 6.60 −5.79 7.12 5.86 3.85 −6.95 6.60 3.85 13.71
• The cutoff for pAUC, t = 0.2. • ² = 10−6.
Results:
1. Marginal pAUC: 0.0838, 0.0537,0.0421,0.0459.
2. Not robust to the initial value and the convergence is not stable. Initial Iterations pAUC a0
V(1) 3 0.1150 (0.884,0.217,0.373,-0.175) V(2) 2 0.0537 (0.000,1.000,0.000,0.000) V(4) 3 0.0454 (-0.096,0.081,-0.052,0.991)
a∗ 3 0.1194 (0.880,0.217,0.417,-0.067) Case 3: Coronary Heart Disease Example.(Liu et al(2005))
Data: p = 4
• Variance: Σx = 0.0034 −0.0004 −0.0002 −0.0051 −0.0004 0.0285 0.0029 0.0417 −0.0002 0.0039 0.0488 0.0268 −0.0051 0.0417 0.0268 0.2846 , Σy = 0.0043 −0.0004 −0.0002 −0.0051 0.0033 0.0415 0.0019 0.0426 0.0006 0.0019 0.0389 0.0010 0.0067 0.0426 0.0010 0.1504
• The cutoff for pAUC, t = 0.2. • ² = 10−6.
Results:
1. Marginal pAUC: 0.0331,0.0392,0.0230,0.0176 2. Not robust to the initial value.
Initial Iterations pAUC a0
V(1) 4 0.0480 (0.9475,0.3150,0.0431,0.0320) V(2) 3 0.0440 (0.8927,0.4395,-0.0992,-0.0029) V(4) 2 0.0230 (0.0000,0.0000,1.0000,0.0000) a∗ 3 0.0480 (0.9476,0.3150,0.0432,0.0321) Su and Liu 0.0384 (1.4600,0.3400,0.4117,0.2216) Su and Liu(scaled) (0.9298,0.2165,0.2622,0.1411) Liu et al. 0.0470(0.019?) (-0.8436,-3.2269,0.2918,-0.1181) Liu et al.(scaled) (-0.2518,-0.9632,0.0871,-0.0352) Liu et al.(change sign) 0.0402
References
[1] Su, J. Q. and Liu, J. S.(1993) Linear combinations of multiple diagnostic markers, Journal of the American
Statistical Association, 88, 1350-1355.
[2] Liu, A., Schisterman, E.F. and Zhu, Y.(1995) On linear combinations of biomarkers to improve diagnostic accu-racy, Statistics in Medicine, 24, 37-47.
行政院國家科學委員會補助國內專家學者出席國際學術會議報告
98 年 8 月 24 日
報告人姓名
薛慧敏
服務機構
及職稱
政治大學統計系
時間
會議
地點
98 年 8 月 2 日至 5 日
美國華盛頓特區
本會核定
補助文號
NSC 98-2118-M-004-004
會議
名稱
(中文)美國統計研討會
(英文)2009 Joint Statistical Meetings
發表
論文
題目
(中文) 針對兩卜瓦松分佈平均數的比較問題之非條件確實統計檢定方法
(英文) Unconditional exact test for comparison of two Poisson means
附件三
表 Y04