• 沒有找到結果。

接受者操作特徵曲線之部份線下面積之極大化

N/A
N/A
Protected

Academic year: 2021

Share "接受者操作特徵曲線之部份線下面積之極大化"

Copied!
26
0
0

加載中.... (立即查看全文)

全文

(1)

行政院國家科學委員會專題研究計畫 期末報告

接受者操作特徵曲線之部份線下面積之極大化

計 畫 類 別 : 個別型

計 畫 編 號 : NSC 101-2118-M-004-004-

執 行 期 間 : 101 年 08 月 01 日至 102 年 07 月 31 日

執 行 單 位 : 國立政治大學統計學系

計 畫 主 持 人 : 薛慧敏

共 同 主 持 人 : 張源俊

計畫參與人員: 碩士班研究生-兼任助理人員:呂泓廷

碩士班研究生-兼任助理人員:蔡志旻

博士班研究生-兼任助理人員:許嫚荏

報 告 附 件 : 出席國際會議研究心得報告及發表論文

公 開 資 訊 : 本計畫涉及專利或其他智慧財產權,2 年後可公開查詢

中 華 民 國 102 年 09 月 05 日

(2)

中 文 摘 要 : 近年來由於生物技術上的重大進展,使得我們能夠以更低成

本蒐集到更高品質的資料。在科學研究中,研究人員可以同

時獲得多個變數的資料。如何有效率的捕捉資料中的重要資

訊以獲得其最佳表現為一重要課題。我們考慮某疾病診斷的

醫學研究,其中研究人員同時蒐集到一組疾病潛在因子變

數。此研究的目的是尋求此組因子的最佳線性組合,使得其

在該疾病的診斷上有最佳的診斷能力。在診斷能力上,我們

考慮的評估準則為接受者操作特徵函數之部分線下面積

(pAUC),其中評估的範圍僅限於在低偽陽率部分。過去在常

態假設下,給定母體參數值,我們曾提出此最佳線性組合的

充分條件,並發展其相關的計算方法。若母體參數為未知

時,我們考慮以樣本資料獲得的最大概似估計量(MLE)代入計

算中,則可獲得最佳線性組合的估計量。在此研究中,我們

將推導此估計量的統計性質,並發展統計檢定方法以檢定這

組潛在因子是否對該疾病具有診斷能力。我們將進行電腦模

擬以驗證理論推導結果,與評估所發展之檢定方法。最後也

將運用我們的方法在實際資料上,以判斷其實用性。

中文關鍵詞: 生物標記選取,最佳線性組合,ROC 曲線部分線下面積

英 文 摘 要 : A biomarker is usually used as a diagnostic or

assessment tool in medical

research. Finding a single ideal biomarker of a high

level of both sensitivity and

specificity is not an easy task; especially when a

high specificity is required for a

population screening tool. Combining multiple

biomarkers is a promising alternative

and can provide a better overall performance than the

use of a single biomarker.

It's known that the area under the receiver

operating characteristic (ROC) curve is

most popular for evaluation of a diagnostic tool. In

this study, we consider the criterion

of the partial area under the ROC curve (pAUC) for

the purpose of population

screening. Under the binormality assumption, we

obtain the optimal linear combination

of biomarkers in the sense of maximizing the pAUC

with a pre-specified

specificity level. Furthermore, statistical testing

procedures based on the optimal

(3)

discriminatory power of a biomarker

set and an individual biomarker, respectively.

Stepwise biomarker selections, by

embedding the proposed tests, are introduced to

identify those biomarkers of statistical

significance among a biomarker set. Rather than for

an exploratory study, our

methods, providing computationally intensive

statistical evidence, are more appropriate

for a confirmatory analysis, where the data has been

adequately filtered. The

applicability of the proposed methods are shown via

several real data sets with a

moderate number of biomarkers.

英文關鍵詞: Biomarker selection, optimal linear combination,

partial area under the ROC curve.

(4)

1

行政院國家科學委員會補助專題研究計畫

□期中進度報告

■期末報告

接受者操作特徵曲線之部份線下面積之極大化

計畫類別:■個別型計畫 □整合型計畫

計畫編號:NSC 101-2118-M-004 -004 -

執行期間: 101 年 8 月 1 日至 102 年 8 月 31 日

執行機構及系所:國立政治大學統計系、中央研究院統計科學研究所

計畫主持人:薛慧敏

共同主持人:張源俊

計畫參與人員:許嫚荏、呂泓廷

本計畫除繳交成果報告外,另含下列出國報告,共 一 份:

□移地研究心得報告

■出席國際學術會議心得報告

□國際合作研究計畫國外研究報告

處理方式:除列管計畫及下列情形者外,得立即公開查詢

□涉及專利或其他智慧財產權,□一年■二年後可公開查詢

中 華 民 國 102 年 9 月 5 日

(5)

Man-Jen Hsu, Yuan-Chin Ivan Chang and Huey-Miin Hsueh

Abstract A biomarker is usually used as a diagnostic or assessment tool in medical research. Finding a single ideal biomarker of a high level of both sensitivity and specificity is not an easy task; especially when a high specificity is required for a population screening tool. Combining multiple biomarkers is a promising alterna-tive and can provide a better overall performance than the use of a single biomarker. It’s known that the area under the receiver operating characteristic (ROC) curve is most popular for evaluation of a diagnostic tool. In this study, we consider the crite-rion of the partial area under the ROC curve (pAUC) for the purpose of population screening. Under the binormality assumption, we obtain the optimal linear com-bination of biomarkers in the sense of maximizing the pAUC with a pre-specified specificity level. Furthermore, statistical testing procedures based on the optimal linear combination are developed to assess the discriminatory power of a biomarker set and an individual biomarker, respectively. Stepwise biomarker selections, by embedding the proposed tests, are introduced to identify those biomarkers of statis-tical significance among a biomarker set. Rather than for an exploratory study, our methods, providing computationally intensive statistical evidence, are more appro-priate for a confirmatory analysis, where the data has been adequately filtered. The applicability of the proposed methods are shown via several real data sets with a moderate number of biomarkers.

Man-Jen Hsu

Department of Statistics, National ChengChi University, Taipei 11605, Taiwan, e-mail: 95354503@nccu.edu.tw

Yuan-Chin Ivan Chang

Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan; Department of Statistics, National ChengChi University, Taipei 11605, Taiwan. e-mail: ycchang@sinica.edu.tw

Huey-Miin Hsueh

Department of Statistics, National ChengChi University, Taipei 11605, Taiwan, e-mail: hsueh@nccu.edu.tw

(6)

1 Introduction

A biomarker is a biological indicator in showing absence, presence or the condition of a disease, and it can be used to determine the status of a subject, the effective-ness of a treatment, and so on. Ideally, a biomarker with both high sensitivity and specificity for accurate prediction is expected. However, it’s not easy to find such a biomarker in practice. Combining biomarkers provides an alternative to improve the performance of currently available ones. For example, the serum prostate-specific antigenPSA is a well-accepted prognostic biomarker to screen for prostate cancer. However, this test has a low specificity and therefore might lead to over-diagnosis. Besides PSA, several potentials are investigated, please see [11]. Nevertheless, no single biomarker among them outperforms PSA, and therefore, more investigators propose the use of a combination of PSA and others. Please see [1] and [9].

The ROC curve is the most popular graphical tool in evaluation of the diagnostic power of a biomarker. It provides an exhaustive look at the relationship between sensitivity and specificity of a biomarker. The area under the ROC curve (AUC) is proposed for an efficient summarization. In some applications, investigators place their emphasis only on a part of the curve. For example, a high level of specificity is required for a biomarker serving as a population screening tool. As a consequence, a biomarker is assessed on the partial area under the ROC curve (pAUC) in a region of specificity above some level. See [13] and the reviews by [4] and by [16].

This study focuses on combining multiple continuous-scaled biomarkers into one single diagnostic or predictive rule for disease with emphases on assessment of each biomarker. For better interpretability, we propose the use of a linear combination for information summarization. The discriminatory power of a linear combination of biomarkers is evaluated on the pAUC. The optimal linear combination, which provides the best discriminatory power among all combinations, is the target solu-tion of research interest. In addisolu-tional to the global predictability, some insights on the importance of an individual biomarker can be obtained from the coefficients. However, it needs to incorporate sampling variation for statistical significance.

In presence of multiple biomarkers, a traditional way is fitting a multiple logistic regression model to the data set for medical diagnosis. For example, see [19]. Alter-natively, seeking the maximal discriminatory power, [17] derived the explicit form of the best linear combination in terms of AUC under a binormal model. Following their study, [6] found a solution, that dominates any others in some scenarios. Nev-ertheless, the dominating scenarios are not universal. [12] and [14] proposed the use of empirical AUC estimates in finding the optimal linear combination. In our ear-lier study, we found that not only the analytical derivation, but also the computation became much more complicated under the pAUC criterion, please refer [2].

Recently, due to newer and better biotechnology, big data are generated easily and related analytical tools are demanding. In developing a binary classification, which is parallel to a diagnostic rule, several algorithm-based approaches have been proposed by directly using either AUC or pAUC as the objective function, such as [7], [8], [21], [5], [10], [15], [3], [20]. However, these algorithm-based methods are unable to accommodate statistical evidence into variable selection. It motivates

(7)

our study in developing some stepwise approaches, embedding adequate statistical tests, to identify important biomarkers for data sets of a moderate size. In which, a biomarker is discarded or selected based on the statistical evidence from data, not only on a computational prospect.

The paper is organized as follows. In Section 2, the sample version of the opti-mal linear combination will be defined. In Section 3, some testing procedures for the global and individual discriminatory power will be proposed. Furthermore, two biomarker selection approaches adopting the proposed tests will be developed. Real example analyses are given in Section 4. We then conclude this paper with a discus-sion in Section 5.

2 Strong Consistency of the Linear Combination Estimator

Maximizing the pAUC

Let X be a random vector of p biomarkers related to the disease of a subject, and Dbe the binary disease status, where D = 1 indicates a subject from the diseased population, D = 0 indicates a subject from the non-diseased population. Suppose that

X|D = d ∼ MV N(µd, Σd), d = 0, 1,

where Σ0and Σ1are positive definite. For any given real vector a ∈ Rp,

aTX|D = d ∼ N(aTµd, Qd),

where Qd= aTΣda, for d = 0, 1. Let Φ(·) denote the cumulative distribution

func-tion of N(0, 1) and Φ−1(·) be its inverse function. Let c(u) = Φ−1(1 − u) and ∆µ= µ1− µ0, then at specificity (1 − u), the sensitivity of aTX is equal to

F(a, u) = Φ a T µ− c(u) √ Q0 √ Q1  .

Therefore, for a false positive rate region (0,t) for some predetermined t ∈ (0, 1), the pAUC of aTX is equal to

pAUC(a) = Z t

0

F(a, u)du. (1)

Similar to AUC, the pAUC has the scale invariant property. For identification pur-poses, in this study the search for the optimal linear combination vector is restricted to the hyper-sphere with an unit radius. Let a∗be such a pAUC maximizer; that is,

a∗= arg maxa∈Ep pAUC(a), where Ep= {a| kak = 1, a ∈ Rp}.

(8)

When the population parameters are unknown, the maximum likelihood esti-mates (MLEs) are employed in a sample version of the optimization problem. As-sume two independent random samples of n0, n1are drawn from the non-diseased

and diseased populations, respectively. The estimated mean vectors and covariance matrices are respectively denoted as follows: ˆµ0, ˆµ1, and ˆΣ0, ˆΣ1. Moreover, let

ˆ

∆µ = ˆµ1− ˆµ0and ˆQd= aTΣˆda, for d = 0, 1. Replacing the unknown parameters

in (1) by their corresponding MLEs defined above, we have a sample version of pAUC below: d pAUCn(a) = Z t 0 ˆ Fn(a, u)du, (2) where ˆ Fn(a, u) = Φ aTˆ µ− c(u)p ˆQ0 p ˆQ1 ! .

Thus, a∗of the optimal linear combination are estimated by the maximizer of (2): ˆan= arg maxa∈Ep pAUCd n(a).

The theorem below shows that ˆanis a strong consistent estimator of a∗.

Theorem 1. Assume that pAUC(a) in (1) is a continuous function of a and has a unique maximizera∗in Ep. Thenˆan→ a∗with probability 1 as n0, n1→ ∞.

In real applications, some of the coefficients in the best linear combination were found to be nearly zero. Numerically, their corresponding biomarkers might have limited contribution to the combination and thus to the disease prediction. In the following section, we will discuss how to assess the significance of biomarkers ob-tained in our maximizing procedure in terms of their discriminatory power. The pro-posed testing procedures will be embedded into our biomarker selection approaches in order to find a compact biomarker set, which consists of only significant biomark-ers in disease diagnosis.

3 Hypothesis Testing and Biomarker Selection

3.1 Testing the Discriminatory Power

Considering only the class of linear combinations, we evaluate the global discrimi-natory power of a set of p ≥ 1 biomarkers, X, by testing the following hypotheses:

H0,g: The biomarker set has no discriminatory power to the disease

versus

(9)

The null hypothesis H0,g is true if and only if the optimal linear combination of

the biomarker set has no discriminatory power. Or equivalently, the maximal pAUC that the set can achieve through its linear combinations is not greater than the ref-erence limit t2/2, which is the pAUC value of the non-informative diagnosis with a

diagonal ROC curve. That is,

H0,g: pAUC(a∗) ≤

t2

2 versus H1,g: pAUC(a

) >t2

2.

By maximizing the sample pAUC defined in (2), we obtain the maximal sample pAUC and use it as the test statistic. That is,

Tg= max a∈Ep

d

pAUCn(a) = dpAUCn(ˆan) = Z t 0 Φ ˆa T n∆ˆµ− c(u)p ˆQ0 p ˆQ1 ! du.

The null hypothesis H0,gis rejected if Tgis sufficiently large.

Let XT = (XTi−, Xi), we consider to assess the contribution of Xigiven the

exis-tence of other biomarkers, Xi−. The following hypothesis is

H0,c: Given Xi−, Xihas no discriminatory power to the disease.

The coefficients of the optimal linear combination of X are written as a∗T = (a∗T

i−, a ∗

i), where a∗i is the corresponding coefficient of Xi. In this problem, we

pro-pose evaluating the biomarker Xifrom a∗i. That is, H0,cis equivalent to

H0,c: a∗i = 0.

The test statistic is the estimator of a∗i, denoted by Tc,i= ˆan,i. The null hypothesis

H0,cis then rejected if Tc,iis either too small or too large.

Due to the complex formulation of the test statistics, the null distribution and the critical values are estimated by a parametric bootstrapping method. Under the null hypothesis, the sampling distribution of the test statistic is estimated. Consider drawing two independent random samples of size n1 and n0 from the estimated

null distribution. Then using the bootstrap samples to find the test statistic. Repeat the sampling B times. The critical value(s) is(are) then equal to the correspondent percentile(s) among these values.

3.2 Biomarker Selection

We now turn to the biomarker selection problem. Assume that X is the vector of the full biomarker set and let ˆaT

n = ( ˆan,1, . . . , ˆan,p) be the estimate of the optimal

linear combination as before. First the biomarkers of X are ordered according to their corresponding | ˆan,i| values in an ascending order. Denoted the revised vector

(10)

standard-ization a priori. We consider two stepwise selection methods: the Forward and the Backward approaches. Define A as the set of biomarkers under consideration for the disease diagnosis in each step for convenience. After successive evaluations of every individual biomarker, we conclude that the biomarkers remaining in A have a significant contribution to the linear combination in terms of pAUC. The details are presented below:

Forward Method

Step 1. Set A = /0. Test the marginal effect of X(p)with respect to H0,(p): X(p)has no discriminatory power. If H0,(p)is rejected, add X(p)to A. Go to the next step. Step 2. Test the significance of X(p−1)with respect to

H0,(p−1): Given A, X(p−1)has no discriminatory power. If H0,(p−1)is rejected, add X(p−1)to A. Go to the next step.

.. .

Step p. Test the significance of X(1)with respect to

H0,(1): Given A, X(1)has no discriminatory power.

If H0,(1)is rejected, add X(1)to A. Stop.

Backward Method

Step 0. Set A = {X}. Test the global effect of A with respect to H0,(0): A has no discriminatory power.

If H0,(0)is rejected, go to the next step; otherwise, stop and conclude A = /0. Step 1. Assess X(1)by removing X(1)from A and test the hypothesis,

H0,(1): Given A, X(1)has no discriminatory power.

If H0,(1)is rejected, add X(1)to A. Go to the next step. Step 2. Assess X(2)by removing X(2)from A and test the hypothesis,

H0,(2): Given A, X(2)has no discriminatory power. If H0,(2)is rejected, add X(2)to A. Go to the next step.

.. .

(11)

Step p. Assess the effect of X(p). If A = {X(p)}, stop; otherwise, remove X(p)from A and

test the following null hypothesis,

H0,(p): Given A, X(p)has no discriminatory power.

If H0,(p)is rejected, add X(p)to A. Stop.

4 Applications to Real Data Sets

We apply our procedures to two real examples in [6] and [18]. To have a fair compar-ison among biomarkers, a standardization is conducted before biomarker selection. In the following, every biomarker in the raw data subtracts the non-diseased group mean and then divides by its pooled sample standard deviation from the two groups for a more unified unit across biomarkers.

The first example is a study of Duchenne Muscular Dystrophy (DMD).The DMD carriers generally are elevated by certain serum enzymes, not by physical symptoms. The measurements of 3 biomarkers of DMD of 87 normal and 38 carrier females were collected in this data set. The sample means of the three biomarkers in the normal and carrier groups are, respectively, ˆµ0= (3.3932, 4.5213, 2.4863)T, ˆµ1=

(4.7615, 4.5228, 3.0105)T; and the sample covariance matrices are

ˆ Σ0=   .0316 −.0039 .0024 −.0039 .0065 .0006 .0024 .0006 .0113  , ˆΣ1=   .7683 −.0050 .3054 −.0050 .0094 −.0064 .3054 −.0064 .2268  .

After data standardization, Table 1 presents the results of biomarker selection at stepwise α = 5%. Both Forward and Backward approaches select the first and the third biomarkers. We find that the decrement in the pAUC by removing the second biomarker is slim.

In another real example, we consider four biomarkers lutein, TBARS, HDL cholesterol, and uric acid for construction of a classification tool for atheroscle-rotic coronary heart disease. A cohort of 434 subjects were selected for the analysis yielding 72 cases and 362 controls. One obtains an insignificant conclusion in test-ing the null hypothesis of normality. For the non-diseased and diseased groups, the estimated means of the four markers are ˆµ0= (0.1275, 0.8845, 4.0766, 6.7724)T,

ˆ

µ1= (0.1402, 0.9337, 4.1225, 6.9112)T; and the two sample covariance matrices

are ˆ Σ0=     .0034 −.0004 −.0002 −.0051 −.0004 .0285 .0039 .0417 −.0002 .0039 .0488 .0268 −.0051 .0417 .0268 .2846     , ˆΣ1=     .0043 .0033 .0006 .0067 .0033 .0415 .0019 .0426 .0006 .0019 .0389 .0010 .0067 .0426 .0010 .1504     .

(12)

From Table 1, after data standardization, we obtain a different optimal linear com-bination of the full data set, in which the impact of the first biomarker lutein is downsized, while that of the other three are increased. Before the biomarker selec-tion, the first two biomarkers, lutein and TBARS, seem to be important to the disease from the magnitudes of their coefficients. However, after the biomarker selection, the two stepwise selections produce the same conclusion that only the biomarker lutein achieves statistical significance, as seen in Table 1.

Table 1 The estimated best linear combination and the corresponding pAUC for the specificity range (0.9,1) in DMD and atherosclerotic coronary heart disease examples.

Case Method a1ˆ a2ˆ a3ˆ a4ˆ pAUCd n

DMD Full set (raw) 0.8350 0.5116 0.2026 - 0.0888

Full set (Standardized) 0.9895 0.0653 0.1292 - 0.0888

Forward Selection 0.9657 0.0000 0.2597 - 0.0885

Backward Selection 0.9657 0.0000 0.2597 - 0.0885

Heart disease Full set (raw) 0.9447 0.3258 0.0265 0.0274 0.0165 Full set (Standardized) 0.7079 0.6754 0.0834 0.1890 0.0165 Forward Selection 1.0000 0.0000 0.0000 0.0000 0.0099 Backward Selection 1.0000 0.0000 0.0000 0.0000 0.0099

5 Discussion

In this study, we focus on disease diagnosis with the presence of multiple biomark-ers. We consider the class of linear combinations for an effective and easy-to-interpret summarization of the multiple biomarkers. The diagnostic power of a linear combination is evaluated upon its pAUC over a clinically relevant threshold region. In specific, we consider the requirement of a high specificity for the purpose of population screening.

Under the binormality assumption, the pAUC of a linear combination is esti-mated by employment of MLEs of the population parameters. In addition, the strong consistency of the estimated optimal linear combination is proved. We introduce a testing procedure to assess the overall diagnostic power of a set of biomarkers from the greatest pAUC it can achieve in the class of linear combinations. Further-more, a testing procedure for determining the conditional contribution of a single biomarker given the existence of other biomarkers is developed. The parametric bootstrap method is applied to find the critical value(s) of the tests. These proposed tests are then embedded in two biomarker selection approaches. The applicability of the proposed methods are illustrated by two real data sets.

Differing with other algorithm-based marker selection approaches, the proposed methods select or discard a biomarker based upon the evidence of statistical sig-nificance. As a trade-off, to acquire statistical evidence, our methods necessarily involve many computations. As such, it decreases the feasibility of these methods

(13)

for big data sets. Consequently, our methods are less appropriate in an exploratory study. We suggest the application of adequate data filtering for dimension reduc-tion prior to advanced statistical confirmatory analysis, such as the construcreduc-tion of a diagnostic rule.

References

1. ETZIONI, R., KOOPERBERG, C., PEPE, M., SMITH, R.ANDGANN, P. H. (2003). Combin-ing biomarkers to detect disease with application to prostate cancer. Biostatistics 4, 523-538. 2. HSU, M.-J. AND HSUEH, H.-M. (2012). The linear combinations of biomarkers which maximize the partial area under the ROC curves. Computational Statistics, to appear. DOI: 10.1007/s00180-012-0321-5.

3. KOMORI, O.ANDEGUCHI, S. (2010). A boosting method for maximizing the partial area under the ROC curve. BFC Bioinformatics 11, 314-330.

4. LASKO, T. A., BHAGWAT, J. G., ZOU, K. H.ANDOHNO-MACHADOL. (2005). The use of receiver operating characteristic curves in biomedical informatics. Journal of Biomedical Informatics38, 404-415.

5. LIN, H., ZHOU, L., PENG, H.AND ZHOU, X.-H. (2011). Selection and combination of biomarkers using ROC method for disease classification and prediction. The Canadian Jour-nal of Statistics39, 324-343.

6. LIU, A., SCHISTERMAN, E. F.ANDZHU, Y. (2005). On linear combinations of biomarkers to improve diagnostic accuracy. Statistics in Medicine 24, 37-47.

7. MA, S.AND HUANG, J. (2005). Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics 21, 4356-4362.

8. MA, S.ANDHUANG, J. (2007). Combining multiple markers for classification Using ROC. Biometrics63, 751-757.

9. MADU, C. O.ANDLU, Y. (2010). Novel diagnostic biomarkers for prostate cancer. Journal of Cancer1, 150-177.

10. MARROCCO, C., DUIN, R. P. W.ANDTORTORELLA, F. (2008). Maximizing the area under the ROC curve by pairwise feature combination. Pattern Recognition 41, 1961-1974. 11. National Cancer Institute: PDQ Prostate Cancer Screening. Bethesda, MD: National CancerR

Institute. Date last modified h06/08/2012i. Available at:

http://www.cancer.gov/cancertopics/pdq/screening/prostate/ HealthProfessional/Page3#Section_67

12. PEPE, M. S.ANDTHOMPSON, M. L. (2000). Combining diagnostic test results to increase accuracy. Biostatistics 1, 123-140.

13. PEPE, M. S., LONGTON, G., ANDERSON, G. L.AND SCHUMMER, M. (2003). Selecting differentially expressed genes from microarray experiments. Biometrics 59, 133-142. 14. PEPE, M. S., CAI, T.ANDLONGTON, G. (2006). Combining predictors for classification

using the area under the receiver operating characteristic curve. Biometrics 62, 221-229. 15. RICAMATO, M. T.ANDTORTORELLA, F. (2011). Partial AUC maximization in a linear

com-bination of dichotomizers. Pattern Recognition 44, 2669-2677.

16. ROBIN, X., TURCK, N., HAINARD, A., TIBERTI, N., LISACEK, F., SANCHEZ, J.-C.AND

MULLER, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BFC Bioinformatics 12, 77-84.

17. SU, J. Q.ANDLIU, J. S. (1993). Linear combinations of multiple diagnostic markers. Jour-nal of the American Statistical Association88, 1350-1355.

18. TIAN, L. (2010). Confidence interval estimation of partial area under curve based on com-bined biomarkers. Computational Statistics & Data Analysis 54, 466-472.

19. TURCK, N., VUTSKITS, L., SANCHEZ-PENA, P., ROBIN, X., HAINARD, A., GEX -FABRY, M., FOUDA, C., BASSEM, H., MUELLER, M., LISACEK, F., PUYBASSET, L.AND

(14)

SANCHEZ, J.-C. (2010). A multiparameter panel method for outcome prediction following aneurysmal subarachnoid hemorrhage. Intensive Care Medicine 36, 107-115.

20. WANG, Z.ANDCHANG, Y.-C. I. (2011). Marker selection via maximizing the partial area under the ROC curve of linear risk scores. Biostatistics 12, 369-385.

21. ZHOU, X. H., CHEN, B., XIE, Y. M., TIAN, F., LIU, H.ANDLIANG, X. (2012). Variable selection using the optimal ROC curve: An application to a traditional Chinese medicine study on osteoporosis disease. Statistics in Medicine 31, 628-635.

(15)

1

國科會補助專題研究計畫出席國際學術會議心得報告

日期: 102 年 9 月 5 日

一、參加會議經過

本人於 8/27 至 8/30 期間赴香港參與此會議,大會安排每天上午、下午各有一至

二節約二小時的研討會。另外在此會議中,本人以張貼海報方式在 8/29 發表研究成

果。

二、與會心得

在本次會議中,本人參加了有關分配理論、天文統計、統計教學與地理資訊系統,

以及目前學者們在統計推論上的一些看法。總結來說,此次會議對本人收益良多,

期望能對未來的研究與教學上有所啟發。

計畫編號

NSC 101- 2118- M- 004- 004-

計畫名稱

接受者操作特徵曲線之部份線下面積之極大化

出國人員

姓名

薛慧敏

服務機構

及職稱

政治大學統計系副教授

會議時間

102 年 8 月 25 日

102 年 8 月 30 日

會議地點

香港

會議名稱

(中文)第五十九屆世界統計會議

(英文)59

th

World Statistics Congress

發表題目

(中文)運用 RNA-Seq 資料偵測差異表現的基因

(16)

2

Recently, the RNA-Seq experiment is developed for a high-throughput DNA sequencing

method for mapping and quantifying the transcriptomes. The gene expression level obtained

from a RNA-Seq experiment is of the count data type and is often fitted by a

Negative-Binomial distribution to account for over-dispersion. To find the differentially

expressed genes with a binary phenotypic response, we aim to develop a statistical test

for comparing the means of two Negative-Binomial distributions. A Wald test statistic

based on the pseudo maximum likelihood estimators is proposed. A numerical study is

performed for justification of the proposed test. The applicability of the proposed method

is demonstrated via the data analysis of two real example data sets.

全文另附。

四、建議

無。

五、攜回資料名稱及內容

大會議程。

六、其他

無。

(17)

Identifying the differentially expressed genes with RNA-Seq data

Hung-Ting Lu, Huey-Miin Hsueh

Department of Statistics, National Chengchi University, Taipei, Taiwan

Corresponding author:Huey-Miin Hsueh, email:hsueh@nccu.edu.tw

April 22, 2013

Abstract

Recently, the RNA-Seq experiment is developed for a high-throughput DNA sequencing method for mapping and quantifying the transcriptomes. The gene expression level obtained from a RNA-Seq experiment is of the count data type and is often fitted by a Negative-Binomial distribution to account for over-dispersion. To find the differentially expressed genes with a binary phenotypic response, we aim to develop a statistical test for comparing the means of two Negative-Binomial distributions. A Wald test statistic based on the pseudo maximum likelihood estimators is proposed. A numerical study is performed for justification of the proposed test. The applicability of the proposed method is demonstrated via the data analysis of two real example data sets.

Keywords: Differentially expressed genes, negative-binomial, overdispersion, pseudo

maxi-mum likelihood estimator, RNA-Seq experiment.

1

Introduction

To quantify the transcripts in a cell, the hybridization-based approaches, such as microarrays,

have become prominent due to the fact that they are high throughput and cost less. On the

other hand, although the traditional sequence-based approach can directly provide the cDNA

sequence, it is of limited use because it is relatively less effective and expensive. Recently, the

RNA-Seq experiment is developed for a high-throughput DNA sequencing method for mapping

and quantifying the transcriptomes. The method improves the existing microarray approaches

in terms of providing more accurate signals, being not limited to existing genomic sequence and

requiring less RNA sample. The RNA-Seq is believed to replace microarrays when the sequencing

cost reduces. Please see Wang, Gerstein and Snyder (2009) for a thorough description of the

experiment.

This study aims to develop a statistical testing procedure to determine whether a gene is

differentially expressed in different experimental/phenotypic conditions. For simplicity, we

con-sider a binary phenotypic response variable. It’s known that the gene expression level obtained

from a RNA-Seq experiment, differing with that from a microarray experiment, is the

num-ber of repeated reads and hence is of the count data type. The researchers assume a poisson

population, or a negative-binomial population to account for over-dispersion, for the expression

level of an individual gene. See Robinson and Smyth (2007, 2008), Robinson et al. (2010), Li

et al. (2012). Due to having simpler formulations, most of the existing methods consider the

likelihood ratio test or the score test. This study propose testing the hypothesis by directly

using the maximum likelihood estimators of the mean expression levels in the two groups under

negative-binomial distributions. However, with the existence of the dispersion parameter, the

(18)

lihood equation and the Stirling’s formula approximation are employed. See Piegorsch (1990)

and Nakashima (1997). The estimation provides a clear picture of the magnitude of the gene

expression. In addition, Nakashima (1997) proved that the resultant estimators have the

prop-erty of the asymptotic normality. As a consequence, we develop a chi-square test for identifying

the differentially expressed genes. Numerical studies, including simulations and real examples,

are performed to justify the proposed test.

2

Method

Denote the G

× 1 random vector, Y

i,j

= (Y

i,j,1

, . . . , Y

i,j,G

)

T

, as the G counts of the j-th subject

in the i-th phenotypic group, for j = 1, . . . , n

i

, i = 1, . . . , k. Here for simplicity, we consider a

binary phenotypic group, k = 2. Assume for each i = 1, 2, g = 1, . . . , G, Y

i,j,g

have the following

a negative-binomial distribution,

Y

i,j,g

∼ ind. NB(µ

i,j,g

, φ

i,g

), for j = 1, . . . , n

i

,

where µ

i,j,g

is the mean parameter and φ

i,g

is the over-dispersion parameter. Note that under

this distributional model,

E(Y

i,j,g

) = µ

i,j,g

, V ar(Y

i,j,g

) = µ

i,j,g

+ φ

i,g

µ

2i,j,g

.

As φ

i,g

tends to zero, a negative-binomial distribution becomes a Poisson distribution. Moreover,

by taking heterogeneous sequence-depths into account, assume for each i = 1, 2, j = 1, . . . , n

i

,

there exists m

i,j

such that

µ

i,j,g

= m

i,j

λ

i,g

,

where m

i,j

is the depth of the j-th sequence in the group i, and λ

i,g

is the mean relative

abundance of the g-th gene in the group i.

The research interest is to determine whether a gene has differential expression levels between

the two phenotypic groups. The following hypotheses are tested: For each g = 1, . . . , G,

H

0,g

: λ

1,g

= λ

2,g

v.s. H

1,g

: λ

1,g

6= λ

2,g

.

We propose using the Wald’s test statistic based on the pseudo maximum likelihood estimators

(PMLEs) to test the hypotheses. Let

Z

g

=

ˆ

λ

1,g

− ˆλ

2,g

ˆ

SE(ˆ

λ

1,g

− ˆλ

2,g

)

,

where ˆ

λ

i,g

are the PMLEs of λ

i,g

, and ˆ

SE(ˆ

λ

1,g

− ˆλ

2,g

) is a consistent estimate of the asymptotic

standard error of ˆ

λ

1,g

− ˆλ

2,g

for i = 1, 2, g = 1, . . . , G. Asymptotically, under H

0,g

, Z

g

∼ N(0, 1).

Consequently, given an observed Z-value, z

g0

the asymptotic p-value is found as 2(1

− Φ(|z

g0

|)),

where Φ(

·) is the distribution function of N(0, 1).

Note that in solving the PMLEs, a large observed transcript value of Y

i,j,g

results in a

computational difficulty. From Stirling’s formula, the following approximation is employed to

overcome the difficulty,

(

N

k

)

N

k

k!

, as N

→ ∞.

2

(19)

error rate for simultaneously testing a numerous number of hypotheses. In literature, the

ad-justed p-value proposed by Benjamini and Hochberg (1995) and the local false discovery rate

proposed by Storey (2003) are two of the most popular multiple testing approaches. In this

study, the two approaches are applied on the obtained asymptotic p-values of Z

g

’s.

3

Results

3.1

Simulations

We consider the simulation study in Auer and Doerge (2011). The transcript levels of ten

thousands genes are generated according to the following model:

Y

i,j,g

indep.

P oisson(λ

i,g

ν

i,j,g

), i = 1, 2, j = 1, . . . , n

i

, g = 1, . . . , 10, 000,

where λ

i,g

is the mean transcript of the g

−th gene in the group i. Suppose the first eight

thou-sands genes are not differentially expressed in the two groups, while the others are differentially

expressed. For each g = 1, . . . , 10, 000,

λ

1,g

i.i.d.

∼ exp(Pareto(location=3, shape=7)).

Furthermore, for g = 1, . . . , 8000, λ

2,g

= λ

1,g

; for g = 8001, . . . , 10000,

λ

2,g

i.i.d.

∼ exp(Pareto(location=3, shape=7)).

On the other hand, ν

i,j,g

is the overdispersion parameter. For each gene, we generate a Bernoulli

random variable D

g

with success probability p, p

∈ (0, 1) to determine the occurrence of the

overdispersion of the gene. The genes of D

g

= 1 are overdispersed. If D

g0

= 1 for some gene g

0

,

we then generate ν

i,j,g0

from the following Gamma distribution,

ν

i,j,g0 i.i.d.

∼ Gamma

(

λ

i,g0

φ

− 1

,

φ

− 1

λ

i,g0

)

, j = 1, . . . , n

i

,

for each i = 1, 2. If D

g

= 0, ν

i,j,g

= 1. We consider p = 0 for the scenario without overdispersion,

and p = 0.5 for the scenario with overdisperion.

Consider n

1

= n

2

= 10.

One hundred

replications are generated. The true FDR of the proposed method is estimated via taking an

average over the replications. Furthermore, the average over the 100 estimated FDR at each

successive rejection is reported as well. The performance of the edgeR method by Robinson et

al. (2010) and the PoissonSeq method by Li et al. (2012) are also presented for a comparison.

Robinson et al. (2010) suggested the use of the BH adjusted p-value by Benjamini and Hochberg

(1995). As well as the BH adjusted p-value, the q-value by Storey (2003) is also employed for

our PMLE method.

The results are plotted in Figure 1, in which the top panel considers the scenario without

overdispersion and the bottom panel considers the scenario with overdispersion. From (a) and

(c), we observe that our test has a slightly higher true FDR than the other two tests. From (b)

and (d), both multiplicity adjustments underestimate the true FDR and hence tend to produce

liberal results.

(20)

0 2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6 0.8 Numbered Called FDR pMLE poissonseq edgeR (a) 0 2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6 0.8 Numbered Called FDR True qvalue BH (b) 0 2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6 0.8 1.0 Numbered Called FDR pMLE poissonseq edgeR (c) 0 2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6 0.8 1.0 Numbered Called FDR True qvalue BH (d)

Figure 1: The top panel gives the FDR curves when the overdispersion does not exist, while the

bottom gives the FDR curves when there is an overdispersion. (a), (c): The FDR curves of the

three tests. (b), (d) : The average FDR curves of the PMLE test by using the BH adjustment

and the q-value (dashed) and the FDR curve (solid).

3.2

Real Examples

The proposed test is applied to the real examples in Marioni et al. (2008) and ’t Hoen et al.

(2008). Five technical replicates of the liver and of the kidney tissue samples of a single human

were sequenced via the Illumina NGS platform. In the example from ’t Hoen et al. (2008), four

biological replicates of the wild-type and of the transgenic mice were collected. Two estimated

FDRs, the BH adjusted p-value by Benjamini and Hochberg (1995) and the q-value by Storey

(2003), are employed for the edgeR and our PMLE method. The estimated FDR curves are

presented in Figure 2.

The upper panel of Figure 2 are the FDR curves from Marioni et al. (2008), while (a)

presents the adjusted p-values by Benjamini and Hochberg (1995), and (b) gives the results

(21)

et al. (2008). In general, the q-value, which adapts an estimate of the proportion of the null

hypotheses, is not more conservative than the BH adjusted p-value. We find that using the

BH adjustment, the proposed PMLE test and the edgeR provide a conservative result than the

PoissonSeq as seen in (a) and (c). However, using the q-value, the PMLE test is more liberal

than the edgeR. The PMLE test with the usage of the q-value is comparable with the PoissonSeq

in the first example and gives more discoveries in the second example.

0 5000 10000 15000 0.0 0.2 0.4 0.6 0.8 1.0 Number called FDR pMLE PoissonSeq edgeR,TMM norm (a) 0 5000 10000 15000 0.0 0.1 0.2 0.3 0.4 Number called FDR pMLE PoissonSeq edgeR,TMM norm (b)

0e+00 4e+04 8e+04

0.0 0.2 0.4 0.6 0.8 1.0 Number called FDR pMLE PoissonSeq edgeR,TMM nor (c)

0e+00 4e+04 8e+04

0.0 0.2 0.4 0.6 0.8 1.0 Number called FDR pMLE PoissonSeq edgeR,TMM nor (d)

Figure 2: The estimated FDR curves by PMLE, PoissonSeq and edgeR for two data sets: (up)

the data set from Marione et al. (2008) and (bottom) the data set from ’t Hoen et al. (2008).

Moreover, in (a) and (c), the BH-adjusted p-values are used for PMLE and edgeR. In (b) and

(d), the q-values are used for PMLE and edgeR.

4

Discussion

In this study, we propose a new test based on the pseudo maximum likelihood estimation to

identify the differentially expressed genes in a RNA-Seq data set.

(22)

[1]

Auer, P. L. and Doerge, R. W. (2011) A Two-stage Poisson Model for Testing RNA-Seq

Data, Statistical Applications in Genetics and Molecular Biology, 10, Iss. 1, Article 26.

[2]

Benjamini, Y. and Hochberg, Y. (1995) Controlling the False Discovery Rate: a Practical

and Powerful Approach to Multiple Testing, J. Roy. Statist. Soc. Ser. B, 57, 289-300.

[3]

Marioni, J. C., Mason, C.E., Mane, S. M., Stephens, M. and Gilad, Y. (2008) Rna-seq:

an Assessment of Technical Reproducibility and Comparison with Gene Expression Arrays,

Genome Res., 18, 1509-1517.

[4]

Nakashima, E. (1997) Some Methods for Estimation in a Negative-Binomial Model, Ann.

Inst. Statist. Math., 49, 101-105.

[5]

Li, J., Witten, D. M., Johnstone, I. M. and Tibshirani, R. (2012) Normalization, Testing,

and False Discovery Rate Estimation for RNA-sequencing Data, Biostatistics, 13, 523-538.

[6]

Piegorsch, W. W. (1990) Maximum Likelihood Estimation for the Negative Binomial

Dis-persion Parameter, Biometrics, 46, 863-867.

[7]

Robinson, M. D., McCarthy, D. J. and Smyth, G. K. (2010) edgeR: a Bioconductor Package

for Differential Expression Analysis of Digital Gene Expression Data, Bioinformatics, 26,

139-140.

[8]

Robinson, M. D. and Smyth, G. K. (2007) Moderated Statistical Tests for Assessing

Dif-ferences in Tag Abundance, Bioinformatics, 23, 2881-2887.

[9]

Robinson, M. D. and Smyth, G. K. (2008) Small-sample Estimation of Negative Binomial

Dispersion, with Applications to SAGE Data, Biostatistics, 9, 321-332.

[10]

Storey, J. D. (2003) The Positive False Discovery Rate: a Bayesian Interpretation and the

q-value, Annals of Statistics, 31, 2013-2035.

[11]

’t Hoen, P. A. C., Ariyurek, Y., Thygesen, H. H., Vreugdenhil, E., Vossen, R. H., De

Menezes, R. X., Boer, J. M., Van Ommen, G. J. and Den Dunnen, J. T. (2008) Deep

sequencing-based expression analysis shows major advances in robustness, resolution and

inter-lab portability over five microarray platforms. Nucleic Acids Research, 36, e141.

[12]

Wang, Z., Gerstein, M. and Snyder, M. (2009) RNA-Seq: a Revolutionary Tool for

Tran-scriptomics, Nat. Rev. Genet., 10, 57-63.

(23)

國科會補助計畫衍生研發成果推廣資料表

日期:2013/09/05

國科會補助計畫

計畫名稱: 接受者操作特徵曲線之部份線下面積之極大化 計畫主持人: 薛慧敏 計畫編號: 101-2118-M-004-004- 學門領域: 臨床試驗與製藥統計

無研發成果推廣資料

(24)

101 年度專題研究計畫研究成果彙整表

計畫主持人:

薛慧敏

計畫編號:

101-2118-M-004-004-計畫名稱:

接受者操作特徵曲線之部份線下面積之極大化

量化

成果項目

實際已達成

數(被接受

或已發表)

預期總達成

數(含實際已

達成數)

本計畫實

際貢獻百

分比

單位

備 註

質 化 說

明:如 數 個 計 畫

共 同 成 果、成 果

列 為 該 期 刊 之

封 面 故 事 ...

期刊論文

0

0

100%

研究報告/技術報告

0

0

100%

研討會論文

0

0

100%

論文著作

專書

0

0

100%

申請中件數

0

0

100%

專利

已獲得件數

0

0

100%

件數

0

0

100%

技術移轉

權利金

0

0

100%

千元

碩士生

2

2

100%

博士生

1

1

100%

博士後研究員

0

0

100%

國內

參與計畫人力

(本國籍)

專任助理

0

0

100%

人次

期刊論文

0

1

100%

研究報告/技術報告

1

1

100%

研討會論文

1

1

100%

論文著作

專書

0

0

100%

章/本

申請中件數

0

0

100%

專利

已獲得件數

0

0

100%

件數

0

0

100%

技術移轉

權利金

0

0

100%

千元

碩士生

0

0

100%

博士生

0

0

100%

博士後研究員

0

0

100%

國外

參與計畫人力

(外國籍)

專任助理

0

0

100%

人次

(25)

其他成果

(

無法以量化表達之成

果如辦理學術活動、獲

得獎項、重要國際合

作、研究成果國際影響

力及其他協助產業技

術發展之具體效益事

項等,請以文字敘述填

列。)

成果項目

量化

名稱或內容性質簡述

測驗工具(含質性與量性)

0

課程/模組

0

電腦及網路系統或工具

0

教材

0

舉辦之活動/競賽

0

研討會/工作坊

0

電子報、網站

0

目 計畫成果推廣之參與(閱聽)人數

0

(26)

國科會補助專題研究計畫成果報告自評表

請就研究內容與原計畫相符程度、達成預期目標情況、研究成果之學術或應用價

值(簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性)

、是否適

合在學術期刊發表或申請專利、主要發現或其他有關價值等,作一綜合評估。

1. 請就研究內容與原計畫相符程度、達成預期目標情況作一綜合評估

■達成目標

□未達成目標(請說明,以 100 字為限)

□實驗失敗

□因故實驗中斷

□其他原因

說明:

2. 研究成果在學術期刊發表或申請專利等情形:

論文:□已發表 ■未發表之文稿 □撰寫中 □無

專利:□已獲得 □申請中 ■無

技轉:□已技轉 □洽談中 ■無

其他:(以 100 字為限)

3. 請依學術成就、技術創新、社會影響等方面,評估研究成果之學術或應用價

值(簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性)(以

500 字為限)

1. 學術研究:對於疾病診斷的醫學研究,提供適當統計檢定方法。

2. 社會應用: 可應用於疾病診斷的醫學研究,或其他類似問題上,以獲得更有效率以及

具有優異診斷力的診斷方法。

數據

Table 1 The estimated best linear combination and the corresponding pAUC for the specificity range (0.9,1) in DMD and atherosclerotic coronary heart disease examples.
Figure 1: The top panel gives the FDR curves when the overdispersion does not exist, while the bottom gives the FDR curves when there is an overdispersion
Figure 2: The estimated FDR curves by PMLE, PoissonSeq and edgeR for two data sets: (up) the data set from Marione et al

參考文獻

相關文件

In order to assess and appreciate the results of all these studies, and to promote further research on the Suan Shu Shu, an international Symposium was held on August 23-25

(2007) demonstrated that the minimum β-aberration design tends to be Q B -optimal if there is more weight on linear effects and the prior information leads to a model of small size;

Based on the reformulation, a semi-smooth Levenberg–Marquardt method was developed, and the superlinear (quadratic) rate of convergence was established under the strict

To facilitate the Administrator to create student accounts, a set of procedures is prepared for the Administrator to extract the student accounts from WebSAMS. For detailed

For example, there are the procedures of “Mazu’s establishment of a monastery” and the standards of “Baizhang’s introduction of pure regulations.” These were established to

 Warrants are an instrument which gives investors the right – but not the obligation – to buy or sell the underlying assets at a pre- set price on or before a specified date.

The min-max and the max-min k-split problem are defined similarly except that the objectives are to minimize the maximum subgraph, and to maximize the minimum subgraph respectively..

We try to explore category and association rules of customer questions by applying customer analysis and the combination of data mining and rough set theory.. We use customer