接受者操作特徵函數線下面積之無母數迴歸分析

(1)

行政院國家科學委員會專題研究計畫成果報告

接受者操作特徵函數線下面積之無母數迴歸分析

研究成果報告(精簡版)

計畫類別：個別型

計畫編號： NSC 98-2118-M-004-004-

執行期間： 98 年 08 月 01 日至 99 年 07 月 31 日

執行單位：國立政治大學統計學系

計畫主持人：薛慧敏

共同主持人：張源俊

計畫參與人員：碩士班研究生-兼任助理人員：劉世凰

博士班研究生-兼任助理人員：許嫚荏

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫涉及專利或其他智慧財產權，2 年後可公開查詢

中華民國 99 年 10 月 28 日

(2)

PARTIAL AREA UNDER THE ROC CURVE(PAUC)

Abstract. When there are multiple markers associated with a disease, except the trivial case, a combination of these markers often performs better than individual ones. Here we focus on the class of linear combinations for an easy and clear interpretation. The AUC (area under the receiver op-erating characteristic curve) criterion proposed for evaluation of medical diagnostic tools nowadays becomes more and more popular in assessing the discriminating power of a binary classification rule with continuous-scale. However, in some real applications, only a limited region of specificity is of research interest, and hence the partial AUC(pAUC) is a more adequate criterion. The goal of this study is to find the best linear combination that maximizes the pAUC. Under the normality assumption, the partial derivative of the pAUC is obtained and has a complex form. Hence the finding of the maximizer(s) is a difficult task. In this study, an algorithm is developed. Intensive numerical studies are conducted for assessment of the algorithm. It’s found that the algorithm has adequate empirical performance.

1. Method

Notation and Basic Results

Consider p biomarkers as diagnostic tools for a speciﬁc disease and a large value of the biomarker favors a positive diagnosis. Let X and Y be the vectors of p variables for non-diseased and diseased population, respectively. Suppose

X ∼ MN(µx, Σx), Y ∼ MN(µy, Σy),

where µx, µy ∈ Rp and Σx and Σy are p× p matrices. Let a ∈ Rp be a vector of coeﬃcients. Then,

given a, the linear combinations aTY, aTY follow

V_D¯ ≡ aTX ∼ N(aTµx, aTΣxa) (1.1)

VD ≡ aTY ∼ N(aTµy, aTΣya). (1.2)

Deﬁne ∆µ= µy− µx, Qx = aTΣxa and Qy = aTΣya. Let F_D¯ and FD be the distribution functions

of V_D¯ and VD, and S = 1− F . It follows that at ﬁxed a, the ROC curve can be derived as

ROC(u; a) = SD[S_D−1¯ (u)] = 1− Φ [ S_D−1_¯ (u)− aTµy √ Qy ] = 1− Φ [ c(u)√Q_√x− aT∆µ Qy ] (1.3)

1991 Mathematics Subject Classiﬁcation.

Key words and phrases.

(3)

where u is some level of false positive rate and c(u) = Φ−1(1− u). Su and Liu(1993) showed that the coeﬃcients for the best linear combination are

a∗∝ Σ_x−1/2(I + Σ−1/2_x ΣyΣ−1/2x )−1Σ−1/2x ∆µ= (Σx+ Σy)−1∆µ,

and the maximal AUC is

Φ (√

∆muT(Σx+ Σy)−1∆mu

)

.

Further, the partial area under ROC curve for a given t ∈ (0, 1) of the linear combination is given by pAU Ct(a) = ∫ t 0 ROC(u; a)du = ∫ t 0 ( 1− Φ [ c(u)√Q_√x− aT∆µ Qy ]) du. (1.4)

Optimal Linear Combination

To find the best linear combination of biomarkers that maximizes the pAUC for a given t, it suffices to find a, such that

∂pAU C(a)

∂a = 0. (1.5)

Theorem 1.1. The coeﬃcient vector of the best linear combination of the p biomarkers, a0, is proportional to (w1Σx+ w2Σy)−1∆µ, (1.6) where w1 = c1 aT₀∆µ Qx+ Qy + c2Qy, w2= c1 aT₀∆µ Qx+ Qy − c2 Qx. In which, c1 = √ 2πσΦ ( ν− c(t) σ ) , c2 = σ2exp [ −(c(t)− ν)2 2σ2 ] ·√ 1 QxQy , further, ν = aT₀∆µ √ Qx/(Qx+ Qy), σ2 = Qy/(Qx+ Qy). Proof. Let A = a T_∆ µ Qy Σya− ∆µand B = Σxa √ Qx − √ Qx Qy Σya, and ν = aT∆µ √

Qx/(Qx+ Qy), σ2 = Qy/(Qx+ Qy). Then Eq. (1.5) can be shown to have the

following form, A ∫ _∞ c(t) exp [ −(y− ν)2 2σ2 ] dy + B ∫ _∞ c(t) y· exp [ −(y− ν)2 2σ2 ] dy = 0. (1.7)

(4)

Note that _∫ ∞ c(t) exp [ −(y− ν)2 2σ2 ] dy =√2πσΦ ( ν− c(t) σ ) , and _∫ ∞ c(t) y· exp [ −(y− ν)2 2σ2 ] dy = σ2exp [ −(c(t)− ν)2 2σ2 ] +√2πνσΦ ( ν− c(t) σ ) .

It follows that (1.7) becomes

c1(A + Bν) + c2B √ QxQy = 0, (1.8) where c1 = √ 2πσΦ ( ν− c(t) σ ) , c2 = σ2exp [ −(c(t)− ν)2 2σ2 ] ·√ 1 QxQy . Because A + Bν = a T_∆ µ Qx+ Qy (Σx+ Σy)a− ∆µ, B √ QxQy = QyΣxa− QxΣya, (1.8) becomes c1 [ aT∆µ Qx+ Qy (Σx+ Σy)a− ∆µ ] + c2[QyΣxa− QxΣya] = 0,

which implies that

c1(µy− µx) = (w1Σx+ w2Σy)a (1.9) where w1 = c1 aT∆µ Qx+ Qy + c2Qy, w2= c1 aT∆µ Qx+ Qy − c2 Qx.

It is known that the pAUC is invariant to the scale, so the best linear combination is

a0∝ (w1Σx+ w2Σy)−1(µy− µx).

(5)

2. Algorithm

From Theorem 1.1, solving for the optimal coeﬃcient vector is equivalent to a ﬁxed-point prob-lem,

a0= f (a0).

We consider the following ”naive” iterated algorithm:

Step 0. Calculate the coeﬃcients a∗ of the best linear combination wrt AUC,

a∗= (Σx+ Σy)−1∆µ.

Step 1. Use a∗ as the initial a(0)₀ = a∗, compute the corresponding pAUC by (1.4). Denote the pAUC by pAU C(0).

Step 2. Calculate f (a(0)₀ ). If f (a(0)₀ )· ∆µ> 0, then a(1)0 = f (a (0) 0 ). Otherwise, a (1) 0 =−f(a (0) 0 )

Step 3. Normalized a(1)₀ to have unit norm and compute the corresponding pAUC, pAU C(1)_.

Step 4. Calculate the increment δ(1) = pAU C(1)− pAUC(0). Then

a. If δ(1)<−², find the first two significant biomarkers according to their absolute

mag-nitudes in a(1)₀ and record as b(1)₂_×1. Also deﬁne b(0) from a(0)₀ . Find the angle between

b(0) and b(1). Rotate b(1) from b(0) by the same angle in reverse direction. Renew a(1)₀ by combining the rotated vector with the original a(1)₀ . Moreover, recalculate pAU C(1) and δ(1).

b. If at the ﬁrst stage,|δ(1)| < ², again ﬁnd b(1) and b(0). Rotate b(0) counterclockwise by some θ1. Renew a(1)0 by combine the rotated vector with the original a

(0)

0 . Moreover,

recalculate pAU C(1), δ(1). Repeat 2-4.

• If δ(1)_{> ², repeat Step 2-4. Otherwise, stop and the let the last a}

0 be the ﬁnal solution. • Note: Here ² = 10−8_{, θ}₁_{= π/8.}

For a simulated data set with an equi-correlation structure, the algorithm is robust to the selec-tion of the initial value and all the convergence is monotone upward. The algorithm performs well with p up to 100. See Case 1. However, for the two examples in Liu et al.(2005), the convergent point of the algorithm is sensitive to the initial value and the monotone convergence is no longer present. Practically, we suggest using the coeﬃcients a∗ of the best linear combination for AUC by Su and Liu(1993) as the initial value. See the results of Case 2-3.

(6)

Case 1: Slightly positive equi-correlation Data: p = 100

• Mean: µx = 0, µy = (0,· · · , 0, 1.02, 1.04, · · · , 2)T. (50 eﬀective biomarkers)

• Variance: Σx = Ip. In Σy, σi,i = 1, and σi,j = ρ, for i6= j. Here ρ = ±0.2.

• The cutoﬀ for pAUC, t = 0.2. • ² = 10−6_.

Results:

Initial Iterations for convergence pAUC

V(1) 3 0.2

V₍₂₎ 3 0.2

V_(p) 3 0.2

Case 2: Liu et al.(SIM, 2005) Data: p = 4 • Mean: µx = (14.46, 23.89, 7.29, 17.68)T, µy = (15, 61, 25.83, 8.2, 19.21)T. • Variance: Σx=         1.88 −3.20 −2.10 −1.85 −3.20 13.33 3.31 7.39 −2.10 3.31 4.67 4.10 −1.85 7.39 4.10 10.33         , Σy =         13.56 −7.28 −5.79 −6.95 −7.28 23.23 7.12 6.60 −5.79 7.12 5.86 3.85 −6.95 6.60 3.85 13.71        

Results:

1. Marginal pAUC: 0.0838, 0.0537,0.0421,0.0459.

2. Not robust to the initial value and the convergence is not stable. Initial Iterations pAUC a0

V(1) 3 0.1150 (0.884,0.217,0.373,-0.175) V(2) 2 0.0537 (0.000,1.000,0.000,0.000) V(4) 3 0.0454 (-0.096,0.081,-0.052,0.991)

a∗ 3 0.1194 (0.880,0.217,0.417,-0.067) Case 3: Coronary Heart Disease Example.(Liu et al(2005))

Data: p = 4

(7)

• Variance: Σx =         0.0034 −0.0004 −0.0002 −0.0051 −0.0004 0.0285 0.0029 0.0417 −0.0002 0.0039 0.0488 0.0268 −0.0051 0.0417 0.0268 0.2846         , Σy =         0.0043 −0.0004 −0.0002 −0.0051 0.0033 0.0415 0.0019 0.0426 0.0006 0.0019 0.0389 0.0010 0.0067 0.0426 0.0010 0.1504        

Results:

1. Marginal pAUC: 0.0331,0.0392,0.0230,0.0176 2. Not robust to the initial value.

Initial Iterations pAUC a0

V(1) 4 0.0480 (0.9475,0.3150,0.0431,0.0320) V(2) 3 0.0440 (0.8927,0.4395,-0.0992,-0.0029) V(4) 2 0.0230 (0.0000,0.0000,1.0000,0.0000) a∗ 3 0.0480 (0.9476,0.3150,0.0432,0.0321) Su and Liu 0.0384 (1.4600,0.3400,0.4117,0.2216) Su and Liu(scaled) (0.9298,0.2165,0.2622,0.1411) Liu et al. 0.0470(0.019?) (-0.8436,-3.2269,0.2918,-0.1181) Liu et al.(scaled) (-0.2518,-0.9632,0.0871,-0.0352) Liu et al.(change sign) 0.0402

(8)

References

[1] Su, J. Q. and Liu, J. S.(1993) Linear combinations of multiple diagnostic markers, Journal of the American

Statistical Association, 88, 1350-1355.

[2] Liu, A., Schisterman, E.F. and Zhu, Y.(1995) On linear combinations of biomarkers to improve diagnostic accu-racy, Statistics in Medicine, 24, 37-47.

(9)

行政院國家科學委員會補助國內專家學者出席國際學術會議報告

98 年 8 月 24 日

報告人姓名

薛慧敏

服務機構

_及職稱

政治大學統計系

時間

會議

地點

98 年 8 月 2 日至 5 日

美國華盛頓特區

本會核定

補助文號

NSC 98-2118-M-004-004

會議

名稱

(中文)美國統計研討會

(英文)2009 Joint Statistical Meetings

發表

論文

題目

(中文) 針對兩卜瓦松分佈平均數的比較問題之非條件確實統計檢定方法

(英文) Unconditional exact test for comparison of two Poisson means

附件三

(10)

表 Y04

一、參加會議經過

第一天為報到程序。之後三天則為參加研討會議。每ㄧ天有四個場次，各有多個演

講廳同時進行會議。另外在每天的上午與下午，在安排的場地上皆有海報發表。本人在

8/4 下午發表海報文章，在其他時間則至各演講廳，聆聽與會學者的文章發表，學習目

前最新學術發展，並適時參與討論。

二、與會心得

本屆大會有約 5000 人與會參加。有來自統計學門與資訊工程學門專家、學者踴躍

參與。此會議提供很好的機會讓不同領域的學者能夠在統計計算與計算統計上的理論、

方法與實際運用上交流與分享。本人在此次會議上有豐富收穫。

三、考察參觀活動(無是項活動者省略)

無。

四、建議

近年來，由於經濟因素，至歐美參加會議之旅費與生活費逐年高漲，國科會補助通

常不敷使用，建議能因應客觀環境，適當提高補助經費，或增加計畫經費流用的彈性，

才能提高參加會議意願。

五、攜回資料名稱及內容

包括書面資料一冊與光碟片一份。其中書面資料為會議議程，光碟片則為發表文章

摘要。

六、其他

無。

(11)

(12)

98 年度專題研究計畫研究成果彙整表

計畫主持人：薛慧敏

計畫編號：98-2118-M-004-004-計畫名稱：接受者操作特徵函數線下面積之無母數迴歸分析

量化

成果項目

實際已達成數（被接受或已發表）預期總達成數(含實際已達成數)

本計畫實

際貢獻百

分比

單位

備註

（

質化說

明：如數個計畫

共同成果、成果

列為該期刊之

封面故事 ...

等

）

期刊論文

1

1 100%

研究報告/技術報告

0

0 100%

研討會論文

0

0 100%

篇

論文著作

專書

0

0 100%

申請中件數

0

0 100%

專利

已獲得件數

0

0 100%

件

件數

0

0 100%

件

技術移轉

權利金

0

0 100%

千元

碩士生

1

1 100%

博士生

1

1 100%

博士後研究員

0

0 100%

國內

參與計畫人力

（本國籍）

專任助理

0

0 100%

人次

期刊論文

1

1 20%

研究報告/技術報告

0

0 100%

研討會論文

1

1 100%

篇

論文著作

專書

0

0 100%

章/本

申請中件數

0

0 100%

專利

已獲得件數

0

0 100%

件

件數

0

0 100%

件

技術移轉

權利金

0

0 100%

千元

碩士生

0

0 100%

博士生

0

0 100%

博士後研究員

0

0 100%

國外

參與計畫人力

（外國籍）

專任助理

0

0 100%

人次

(13)

其他成果

(

無法以量化表達之成

果如辦理學術活動、獲

得獎項、重要國際合

作、研究成果國際影響

力及其他協助產業技

術發展之具體效益事

項等，請以文字敘述填

列。)

成果項目量化 名稱或內容性質簡述 測驗工具(含質性與量性)

0

課程/模組

0

電腦及網路系統或工具

0

教材

0

舉辦之活動/競賽

0

研討會/工作坊

0

電子報、網站

0

科教處計畫加填項目計畫成果推廣之參與（閱聽）人數

0

(14)

(15)

接受者操作特徵函數線下面積之無母數迴歸分析

行政院國家科學委員會專題研究計畫 成果報告

接受者操作特徵函數線下面積之無母數迴歸分析

研究成果報告(精簡版)

計 畫 類 別 ： 個別型

計 畫 編 號 ： NSC 98-2118-M-004-004-

執 行 期 間 ： 98 年 08 月 01 日至 99 年 07 月 31 日

執 行 單 位 ： 國立政治大學統計學系

計 畫 主 持 人 ： 薛慧敏

共 同 主 持 人 ： 張源俊

計畫參與人員： 碩士班研究生-兼任助理人員：劉世凰

博士班研究生-兼任助理人員：許嫚荏

報 告 附 件 ： 出席國際會議研究心得報告及發表論文

處 理 方 式 ： 本計畫涉及專利或其他智慧財產權，2 年後可公開查詢

中 華 民 國 99 年 10 月 28 日

行政院國家科學委員會補助國內專家學者出席國際學術會議報告

98 年 8 月 24 日

報告人姓名

薛慧敏

服務機構

及職稱

政治大學統計系

時間

會議

地點

98 年 8 月 2 日至 5 日

美國華盛頓特區

本會核定

補助文號

NSC 98-2118-M-004-004

會議

名稱

(中文)美國統計研討會

(英文)2009 Joint Statistical Meetings

發表

論文

題目

(中文) 針對兩卜瓦松分佈平均數的比較問題之非條件確實統計檢定方法

(英文) Unconditional exact test for comparison of two Poisson means

附件三

一、參加會議經過

第一天為報到程序。之後三天則為參加研討會議。每ㄧ天有四個場次，各有多個演

講廳同時進行會議。另外在每天的上午與下午，在安排的場地上皆有海報發表。本人在

8/4 下午發表海報文章，在其他時間則至各演講廳，聆聽與會學者的文章發表，學習目

前最新學術發展，並適時參與討論。

二、與會心得

本屆大會有約 5000 人與會參加。有來自統計學門與資訊工程學門專家、學者踴躍

參與。此會議提供很好的機會讓不同領域的學者能夠在統計計算與計算統計上的理論、

方法與實際運用上交流與分享。本人在此次會議上有豐富收穫。

三、考察參觀活動(無是項活動者省略)

無。

四、建議

近年來，由於經濟因素，至歐美參加會議之旅費與生活費逐年高漲，國科會補助通

常不敷使用，建議能因應客觀環境，適當提高補助經費，或增加計畫經費流用的彈性，

才能提高參加會議意願。

五、攜回資料名稱及內容

包括書面資料一冊與光碟片一份。其中書面資料為會議議程，光碟片則為發表文章

摘要。

六、其他

無。

98 年度專題研究計畫研究成果彙整表

計畫主持人：薛慧敏

計畫編號：98-2118-M-004-004-計畫名稱：接受者操作特徵函數線下面積之無母數迴歸分析

量化

成果項目

本計畫實

際貢獻百

分比

單位

備 註

質 化 說

明：如 數 個 計 畫

共 同 成 果、成 果

列 為 該 期 刊 之

封 面 故 事 ...

等

期刊論文

1

1

100%

行政院國家科學委員會專題研究計畫成果報告

計畫類別：個別型

計畫編號： NSC 98-2118-M-004-004-

執行期間： 98 年 08 月 01 日至 99 年 07 月 31 日

執行單位：國立政治大學統計學系

計畫主持人：薛慧敏

共同主持人：張源俊

計畫參與人員：碩士班研究生-兼任助理人員：劉世凰

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫涉及專利或其他智慧財產權，2 年後可公開查詢

中華民國 99 年 10 月 28 日

_及職稱

備註

質化說

明：如數個計畫

共同成果、成果

列為該期刊之

封面故事 ...