• 沒有找到結果。

接受者操作特徵函數線下面積之無母數迴歸分析

N/A
N/A
Protected

Academic year: 2021

Share "接受者操作特徵函數線下面積之無母數迴歸分析"

Copied!
15
0
0

加載中.... (立即查看全文)

全文

(1)

行政院國家科學委員會專題研究計畫 成果報告

接受者操作特徵函數線下面積之無母數迴歸分析

研究成果報告(精簡版)

計 畫 類 別 : 個別型

計 畫 編 號 : NSC 98-2118-M-004-004-

執 行 期 間 : 98 年 08 月 01 日至 99 年 07 月 31 日

執 行 單 位 : 國立政治大學統計學系

計 畫 主 持 人 : 薛慧敏

共 同 主 持 人 : 張源俊

計畫參與人員: 碩士班研究生-兼任助理人員:劉世凰

博士班研究生-兼任助理人員:許嫚荏

報 告 附 件 : 出席國際會議研究心得報告及發表論文

處 理 方 式 : 本計畫涉及專利或其他智慧財產權,2 年後可公開查詢

中 華 民 國 99 年 10 月 28 日

(2)

PARTIAL AREA UNDER THE ROC CURVE(PAUC)

Abstract. When there are multiple markers associated with a disease, except the trivial case, a combination of these markers often performs better than individual ones. Here we focus on the class of linear combinations for an easy and clear interpretation. The AUC (area under the receiver op-erating characteristic curve) criterion proposed for evaluation of medical diagnostic tools nowadays becomes more and more popular in assessing the discriminating power of a binary classification rule with continuous-scale. However, in some real applications, only a limited region of specificity is of research interest, and hence the partial AUC(pAUC) is a more adequate criterion. The goal of this study is to find the best linear combination that maximizes the pAUC. Under the normality assumption, the partial derivative of the pAUC is obtained and has a complex form. Hence the finding of the maximizer(s) is a difficult task. In this study, an algorithm is developed. Intensive numerical studies are conducted for assessment of the algorithm. It’s found that the algorithm has adequate empirical performance.

1. Method

Notation and Basic Results

Consider p biomarkers as diagnostic tools for a specific disease and a large value of the biomarker favors a positive diagnosis. Let X and Y be the vectors of p variables for non-diseased and diseased population, respectively. Suppose

X ∼ MN(µx, Σx), Y ∼ MN(µy, Σy),

where µx, µy ∈ Rp and Σx and Σy are p× p matrices. Let a ∈ Rp be a vector of coefficients. Then,

given a, the linear combinations aTY, aTY follow

VD¯ ≡ aTX ∼ N(aTµx, aTΣxa) (1.1)

VD ≡ aTY ∼ N(aTµy, aTΣya). (1.2)

Define ∆µ= µy− µx, Qx = aTΣxa and Qy = aTΣya. Let FD¯ and FD be the distribution functions

of VD¯ and VD, and S = 1− F . It follows that at fixed a, the ROC curve can be derived as

ROC(u; a) = SD[SD−1¯ (u)] = 1− Φ [ SD−1¯ (u)− aTµyQy ] = 1− Φ [ c(u)√Qx− aTµ Qy ] (1.3)

1991 Mathematics Subject Classification.

Key words and phrases.

(3)

where u is some level of false positive rate and c(u) = Φ−1(1− u). Su and Liu(1993) showed that the coefficients for the best linear combination are

a∗∝ Σx−1/2(I + Σ−1/2x ΣyΣ−1/2x )−1Σ−1/2xµ= (Σx+ Σy)−1µ,

and the maximal AUC is

Φ (√

muTx+ Σy)−1mu

)

.

Further, the partial area under ROC curve for a given t ∈ (0, 1) of the linear combination is given by pAU Ct(a) =t 0 ROC(u; a)du =t 0 ( 1− Φ [ c(u)√Qx− aTµ Qy ]) du. (1.4)

Optimal Linear Combination

To find the best linear combination of biomarkers that maximizes the pAUC for a given t, it suffices to find a, such that

∂pAU C(a)

∂a = 0. (1.5)

Theorem 1.1. The coefficient vector of the best linear combination of the p biomarkers, a0, is proportional to (wx+ wy)−1µ, (1.6) where w1 = c1 aT0µ Qx+ Qy + c2Qy, w2= c1 aT0µ Qx+ Qy − c2 Qx. In which, c1 = 2πσΦ ( ν− c(t) σ ) , c2 = σ2exp [ −(c(t)− ν)2 2 ] ·√ 1 QxQy , further, ν = aT0µ Qx/(Qx+ Qy), σ2 = Qy/(Qx+ Qy). Proof. Let A = a T µ Qy Σya− ∆µand B = Σxa Qx Qx Qy Σya, and ν = aTµ

Qx/(Qx+ Qy), σ2 = Qy/(Qx+ Qy). Then Eq. (1.5) can be shown to have the

following form, A c(t) exp [ −(y− ν)2 2 ] dy + B c(t) y· exp [ −(y− ν)2 2 ] dy = 0. (1.7)

(4)

Note that c(t) exp [ −(y− ν)2 2 ] dy =√2πσΦ ( ν− c(t) σ ) , and c(t) y· exp [ −(y− ν)2 2 ] dy = σ2exp [ −(c(t)− ν)2 2 ] +√2πνσΦ ( ν− c(t) σ ) .

It follows that (1.7) becomes

c1(A + Bν) + c2BQxQy = 0, (1.8) where c1 = 2πσΦ ( ν− c(t) σ ) , c2 = σ2exp [ −(c(t)− ν)2 2 ] ·√ 1 QxQy . Because A + Bν = a T µ Qx+ Qyx+ Σy)a− ∆µ, BQxQy = QyΣxa− QxΣya, (1.8) becomes c1 [ aTµ Qx+ Qyx+ Σy)a− ∆µ ] + c2[QyΣxa− QxΣya] = 0,

which implies that

c1(µy− µx) = (wx+ wy)a (1.9) where w1 = c1 aTµ Qx+ Qy + c2Qy, w2= c1 aTµ Qx+ Qy − c2 Qx.

It is known that the pAUC is invariant to the scale, so the best linear combination is

a0∝ (wx+ wy)−1(µy− µx).

(5)

2. Algorithm

From Theorem 1.1, solving for the optimal coefficient vector is equivalent to a fixed-point prob-lem,

a0= f (a0).

We consider the following ”naive” iterated algorithm:

Step 0. Calculate the coefficients a∗ of the best linear combination wrt AUC,

a∗= (Σx+ Σy)−1µ.

Step 1. Use a∗ as the initial a(0)0 = a∗, compute the corresponding pAUC by (1.4). Denote the pAUC by pAU C(0).

Step 2. Calculate f (a(0)0 ). If f (a(0)0 )· ∆µ> 0, then a(1)0 = f (a (0) 0 ). Otherwise, a (1) 0 =−f(a (0) 0 )

Step 3. Normalized a(1)0 to have unit norm and compute the corresponding pAUC, pAU C(1).

Step 4. Calculate the increment δ(1) = pAU C(1)− pAUC(0). Then

a. If δ(1)<−², find the first two significant biomarkers according to their absolute

mag-nitudes in a(1)0 and record as b(1)2×1. Also define b(0) from a(0)0 . Find the angle between

b(0) and b(1). Rotate b(1) from b(0) by the same angle in reverse direction. Renew a(1)0 by combining the rotated vector with the original a(1)0 . Moreover, recalculate pAU C(1) and δ(1).

b. If at the first stage,(1)| < ², again find b(1) and b(0). Rotate b(0) counterclockwise by some θ1. Renew a(1)0 by combine the rotated vector with the original a

(0)

0 . Moreover,

recalculate pAU C(1), δ(1). Repeat 2-4.

• If δ(1)> ², repeat Step 2-4. Otherwise, stop and the let the last a

0 be the final solution. • Note: Here ² = 10−8, θ1= π/8.

For a simulated data set with an equi-correlation structure, the algorithm is robust to the selec-tion of the initial value and all the convergence is monotone upward. The algorithm performs well with p up to 100. See Case 1. However, for the two examples in Liu et al.(2005), the convergent point of the algorithm is sensitive to the initial value and the monotone convergence is no longer present. Practically, we suggest using the coefficients a∗ of the best linear combination for AUC by Su and Liu(1993) as the initial value. See the results of Case 2-3.

(6)

Case 1: Slightly positive equi-correlation Data: p = 100

• Mean: µx = 0, µy = (0,· · · , 0, 1.02, 1.04, · · · , 2)T. (50 effective biomarkers)

• Variance: Σx = Ip. In Σy, σi,i = 1, and σi,j = ρ, for i6= j. Here ρ = ±0.2.

• The cutoff for pAUC, t = 0.2. • ² = 10−6.

Results:

Initial Iterations for convergence pAUC

V(1) 3 0.2

V(2) 3 0.2

V(p) 3 0.2

Case 2: Liu et al.(SIM, 2005) Data: p = 4 • Mean: µx = (14.46, 23.89, 7.29, 17.68)T, µy = (15, 61, 25.83, 8.2, 19.21)T. • Variance: Σx=         1.88 −3.20 −2.10 −1.85 −3.20 13.33 3.31 7.39 −2.10 3.31 4.67 4.10 −1.85 7.39 4.10 10.33         , Σy =         13.56 −7.28 −5.79 −6.95 −7.28 23.23 7.12 6.60 −5.79 7.12 5.86 3.85 −6.95 6.60 3.85 13.71        

• The cutoff for pAUC, t = 0.2. • ² = 10−6.

Results:

1. Marginal pAUC: 0.0838, 0.0537,0.0421,0.0459.

2. Not robust to the initial value and the convergence is not stable. Initial Iterations pAUC a0

V(1) 3 0.1150 (0.884,0.217,0.373,-0.175) V(2) 2 0.0537 (0.000,1.000,0.000,0.000) V(4) 3 0.0454 (-0.096,0.081,-0.052,0.991)

a∗ 3 0.1194 (0.880,0.217,0.417,-0.067) Case 3: Coronary Heart Disease Example.(Liu et al(2005))

Data: p = 4

(7)

• Variance: Σx =         0.0034 −0.0004 −0.0002 −0.0051 −0.0004 0.0285 0.0029 0.0417 −0.0002 0.0039 0.0488 0.0268 −0.0051 0.0417 0.0268 0.2846         , Σy =         0.0043 −0.0004 −0.0002 −0.0051 0.0033 0.0415 0.0019 0.0426 0.0006 0.0019 0.0389 0.0010 0.0067 0.0426 0.0010 0.1504        

• The cutoff for pAUC, t = 0.2. • ² = 10−6.

Results:

1. Marginal pAUC: 0.0331,0.0392,0.0230,0.0176 2. Not robust to the initial value.

Initial Iterations pAUC a0

V(1) 4 0.0480 (0.9475,0.3150,0.0431,0.0320) V(2) 3 0.0440 (0.8927,0.4395,-0.0992,-0.0029) V(4) 2 0.0230 (0.0000,0.0000,1.0000,0.0000) a∗ 3 0.0480 (0.9476,0.3150,0.0432,0.0321) Su and Liu 0.0384 (1.4600,0.3400,0.4117,0.2216) Su and Liu(scaled) (0.9298,0.2165,0.2622,0.1411) Liu et al. 0.0470(0.019?) (-0.8436,-3.2269,0.2918,-0.1181) Liu et al.(scaled) (-0.2518,-0.9632,0.0871,-0.0352) Liu et al.(change sign) 0.0402

(8)

References

[1] Su, J. Q. and Liu, J. S.(1993) Linear combinations of multiple diagnostic markers, Journal of the American

Statistical Association, 88, 1350-1355.

[2] Liu, A., Schisterman, E.F. and Zhu, Y.(1995) On linear combinations of biomarkers to improve diagnostic accu-racy, Statistics in Medicine, 24, 37-47.

(9)

行政院國家科學委員會補助國內專家學者出席國際學術會議報告

98 年 8 月 24 日

報告人姓名

薛慧敏

服務機構

及職稱

政治大學統計系

時間

會議

地點

98 年 8 月 2 日至 5 日

美國華盛頓特區

本會核定

補助文號

NSC 98-2118-M-004-004

會議

名稱

(中文)美國統計研討會

(英文)2009 Joint Statistical Meetings

發表

論文

題目

(中文) 針對兩卜瓦松分佈平均數的比較問題之非條件確實統計檢定方法

(英文) Unconditional exact test for comparison of two Poisson means

附件三

(10)

表 Y04

一、參加會議經過

第一天為報到程序。之後三天則為參加研討會議。每ㄧ天有四個場次,各有多個演

講廳同時進行會議。另外在每天的上午與下午,在安排的場地上皆有海報發表。本人在

8/4 下午發表海報文章,在其他時間則至各演講廳,聆聽與會學者的文章發表,學習目

前最新學術發展,並適時參與討論。

二、與會心得

本屆大會有約 5000 人與會參加。有來自統計學門與資訊工程學門專家、學者踴躍

參與。此會議提供很好的機會讓不同領域的學者能夠在統計計算與計算統計上的理論、

方法與實際運用上交流與分享。本人在此次會議上有豐富收穫。

三、考察參觀活動(無是項活動者省略)

無。

四、建議

近年來,由於經濟因素,至歐美參加會議之旅費與生活費逐年高漲,國科會補助通

常不敷使用,建議能因應客觀環境,適當提高補助經費,或增加計畫經費流用的彈性,

才能提高參加會議意願。

五、攜回資料名稱及內容

包括書面資料一冊與光碟片一份。其中書面資料為會議議程,光碟片則為發表文章

摘要。

六、其他

無。

(11)
(12)

98 年度專題研究計畫研究成果彙整表

計畫主持人:薛慧敏

計畫編號:98-2118-M-004-004-計畫名稱:接受者操作特徵函數線下面積之無母數迴歸分析

量化

成果項目

實際已達成 數(被接受 或已發表) 預期總達成 數(含實際已 達成數)

本計畫實

際貢獻百

分比

單位

備 註

質 化 說

明:如 數 個 計 畫

共 同 成 果、成 果

列 為 該 期 刊 之

封 面 故 事 ...

期刊論文

1

1

100%

研究報告/技術報告

0

0

100%

研討會論文

0

0

100%

論文著作

專書

0

0

100%

申請中件數

0

0

100%

專利

已獲得件數

0

0

100%

件數

0

0

100%

技術移轉

權利金

0

0

100%

千元

碩士生

1

1

100%

博士生

1

1

100%

博士後研究員

0

0

100%

國內

參與計畫人力

(本國籍)

專任助理

0

0

100%

人次

期刊論文

1

1

20%

研究報告/技術報告

0

0

100%

研討會論文

1

1

100%

論文著作

專書

0

0

100%

章/本

申請中件數

0

0

100%

專利

已獲得件數

0

0

100%

件數

0

0

100%

技術移轉

權利金

0

0

100%

千元

碩士生

0

0

100%

博士生

0

0

100%

博士後研究員

0

0

100%

國外

參與計畫人力

(外國籍)

專任助理

0

0

100%

人次

(13)

其他成果

(

無法以量化表達之成

果如辦理學術活動、獲

得獎項、重要國際合

作、研究成果國際影響

力及其他協助產業技

術發展之具體效益事

項等,請以文字敘述填

列。)

成果項目 量化 名稱或內容性質簡述 測驗工具(含質性與量性)

0

課程/模組

0

電腦及網路系統或工具

0

教材

0

舉辦之活動/競賽

0

研討會/工作坊

0

電子報、網站

0

目 計畫成果推廣之參與(閱聽)人數

0

(14)
(15)

國科會補助專題研究計畫成果報告自評表

請就研究內容與原計畫相符程度、達成預期目標情況、研究成果之學術或應用價

值(簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性)

、是否適

合在學術期刊發表或申請專利、主要發現或其他有關價值等,作一綜合評估。

1. 請就研究內容與原計畫相符程度、達成預期目標情況作一綜合評估

■達成目標

□未達成目標(請說明,以 100 字為限)

□實驗失敗

□因故實驗中斷

□其他原因

說明:

2. 研究成果在學術期刊發表或申請專利等情形:

論文:□已發表 □未發表之文稿 ■撰寫中 □無

專利:□已獲得 □申請中 ■無

技轉:□已技轉 □洽談中 ■無

其他:(以 100 字為限)

3. 請依學術成就、技術創新、社會影響等方面,評估研究成果之學術或應用價

值(簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性)(以

500 字為限)

學術成就:提供適當統計方法於合併數個特徵變數在診斷或分類問題上。

技術創新:可應用於臨床實驗診斷資料的統計分析上。

社會影響:有利於國家生物科技相關產業發展。

參考文獻

相關文件

Animal or vegetable fats and oils and their fractiors, boiled, oxidised, dehydrated, sulphurised, blown, polymerised by heat in vacuum or in inert gas or otherwise chemically

For 5 to be the precise limit of f(x) as x approaches 3, we must not only be able to bring the difference between f(x) and 5 below each of these three numbers; we must be able

[This function is named after the electrical engineer Oliver Heaviside (1850–1925) and can be used to describe an electric current that is switched on at time t = 0.] Its graph

The Shannon entropy for a specific source X can be seen as the amount of our ignorance about the value of the next letter, or the amount of indeterminancy of the unknownm letter..

Let us emancipate the student, and give him time and opportunity for the cultivation of his mind, so that in his pupilage he shall not be a puppet in the hands of others, but rather

11[] If a and b are fixed numbers, find parametric equations for the curve that consists of all possible positions of the point P in the figure, using the angle (J as the

Understanding and inferring information, ideas, feelings and opinions in a range of texts with some degree of complexity, using and integrating a small range of reading

• School-based curriculum is enriched to allow for value addedness in the reading and writing performance of the students. • Students have a positive attitude and are interested and