• 沒有找到結果。

由離群值建構的基因分析

N/A
N/A
Protected

Academic year: 2021

Share "由離群值建構的基因分析"

Copied!
16
0
0

加載中.... (立即查看全文)

全文

(1)

由離群值建構的基因分析 研究成果報告(精簡版) 計 畫 類 別 : 個別型 計 畫 編 號 : NSC 98-2118-M-009-001- 執 行 期 間 : 98 年 08 月 01 日至 99 年 07 月 31 日 執 行 單 位 : 國立交通大學統計學研究所 計 畫 主 持 人 : 陳鄰安 計畫參與人員: 碩士班研究生-兼任助理人員:刁瀅潔 碩士班研究生-兼任助理人員:林書維 碩士班研究生-兼任助理人員:鄭秋煒 碩士班研究生-兼任助理人員:陳怡頻 碩士班研究生-兼任助理人員:林洋德 博士班研究生-兼任助理人員:魏裕中 處 理 方 式 : 本計畫可公開查詢 中 華 民 國 99 年 10 月 22 日

(2)

Report of NSC Project on \Nonparametric Test based on Outlier Mean

for Gene Expression Analysis

by

Lin-An Chen

Institute of Statistics, National Chiao Tung University

Contents

1

Introduction

2

Research Purpose

3

Literature Review

4

Research Methods

5

Results and Discussions

6

Judgements for Research Results

7

References TypesetbyA M S-T E X

(3)

Nonparametric Test based on Outlier Mean

for Gene Expression Analysis

1. Introduction

DNA microarray technology, which simultaneously probes thousands of gene expression proles, has been successfully used in medical research for disease classication (Agrawal et al. (2002) Alizadeh et al. (2000) Ohki et al. (2005)) Sorlie et al. (2003)). Among the existed techniques in di eren-tial genes detection, common statistical methods for two-group comparisons such ast-test, are not appropriate due to a large number of genes expressions

and a limited number of subjects available. Several statistical approaches have been proposed to identify those genes where only a subset of the sam-ple genes has high expression. Among them, Tomlins et al. (2005) observed that there is small number of outliers in samples of di erential genes and then introduced a method called cancer outlier prole analysis that identies outlier proles by a statistic based on the median and the median absolute deviation of a gene expression prole. With this observation, a sequence of approaches then concentrated on detecting di erential genes based on out-lier samples while Tibshirani and Hastie (2007) and Wu (2007) suggested to

(4)

use an outlier sum, the sum of all the gene expression values in the disease group that are greater than a specied cuto point. The common disad-vantage of these techniques is that the distribution theory of the proposed methods has not been discovered so that the distribution based p value can

not been applied. Recently Chen, Chen and Chan (2010) considered the outlier mean (average of outlier sum) and developed a parametric study with specifying the normal distribution. Although the framework of a test for gene expression analysis based on outlier mean is then established, the understanding applying this outlier mean or outlier sum in nonparametric situation is very limited while gene expression data is generally non-normal. Hence, in this project, we study nonparametric gene expression analysis.

2. Research Purpose

Our purpose in this research is to establish an outlier mean based non-parametric test that is appropriate to be applied for gene expression analysis. First, we show that the outlier mean of Chen, Chen and Chan (2009) is an ecient technique in theoretical power performance. Second, a nonpara-metric statistical inference procedure may be theoretically very ecient but it is inecient in practical application when it involves inecient

(5)

parame-ters estimation. We see that the outlier mean based test involves unknown densities at tail quantiles so that its power may be remarkably reduced with inecient extreme density estimation. Third, we propose an alternative de-sign of outlier mean test that can avoid the diculty of estimating unknown density poits.

3. Literature Review

There are some manuscripts dealt with approaches closed related to the outlier observations. Tomlins et al. (2005) observed that there is small number of outliers in samples of di erential genes and then introduced a method called cancer outlier prole analysis that identies outlier proles by a statistic based on the median and the median absolute deviation of a gene expression prole. With this observation, a sequence of approaches then concentrated on detecting di erential genes based on outlier samples while Tibshirani and Hastie (2007) and Wu (2007) suggested to use an outlier sum, the sum of all the gene expression values in the disease group that are greater than a specied cuto point. Chen, Chen and Chan (2010) developed parametric inferences based on outlier mean in gene expression that allows us to formulate the pvalue based on its asymptotic distribution.

(6)

A nonparametric approach allowing to formulate the p value is still not

available.

4. Research Methods

The outlier mean proposed by Chen, Chen and Chan (2010) is

L Y = P n 2 i=1 Y i IfY i g^ P n2 i=1 IfY i g^

that is to estimate the following population outlier mean

 `

Y =

E(YjY )

whereY

i's are sample from disease group and the cuto point ^

 is computed

based on sample from normal group data. In this research, we prove that p

n 2( L Y ; ` Y) converges in distribution

to a normal random variable having distribution N(0 2 `Y) for an unknown constants  2 ` Y. Then under H 0 : F x = F

y, we have the following, P H 0 f p n 2( L Y ; ` x  ` Y )zg! Z z ;1 (z)dz

for z 2 R where  represents the probability density function of N(01)

where we have  `

x in the function since the sample outlier mean L

Y is to

estimate 

`y that is supposed to compare with 

`x. If we have ^ 

(7)

^



`X, respectively, nonparametric estimates of 

`Y and 

`X, we may dene

an outlier mean based test as rejecting H 0 if n 1=2 2 ( L Y ;^ ` X ^  `Y )z  :

Having this outlier mean based nonparametric test, it is desired to verify the power performance of this test when there exists distributional shift for the disease group distribution. An approximate power with signicant le vel 

 may be derived as bellows P F Y f p n 2( L Y ;^ ` X ^  ` Y )z  g =P FY f p n 2( L Y ; ` Y  ` Y ) z  ^  ` Y + p n 2(^  ` X ; ` Y)  ` Y g PfZ z  + p n 2(  ` X ; ` Y)  ` Y g

In this research, we consider two cuto points, 1 = 2 F ;1 X (1 ;);F ;1 X ( ) and  2 = F ;1 X (

) for studying outlier mean's power performance.

5. Results and Discussions

We have derived the asymptotic variance 2 `

Y for cuto point  1 = 2 F ;1 X (1 ; );F ;1 X ( ) as  2 ` Y = (1;)((1;)b 1 ;b 2) 2+ 2(1 ;2) 3( b 1+ b 2) 2 +(1;)(b 1 ;(1;)b 2) 2+ 1 2 Varf(Y ; Y) I(Y )g:

(8)

where b 1 = 1 Y (; Y) f Y( ) 1=2 f ;1 X ( F ;1 X ( )) b 2 = ;2 Y (; Y) f Y( ) 1=2 f ;1 X ( F ;1 X (1 ;)):

We have observed that the outlier mean may have satisfactory power per-formance when we have consistent estimators ^

` Y and ^  ` Y to construct a test. However, ^ ` Y involves estimations of f Y(2 F ;1 X (1 ;))f X( F ;1 X (1 ;)) and f X( F ;1 X (

)) while estimation of density function of tail quantile is

ex-tremely dicult in practice. Without an alternative proposal avoiding this density estimation, the outlier mean based test won't be powerful in detec-tion of inuential genes while the sample sizes in gene expression analysis are generally not allowed to be very large.

Hence, we propose an alternative cuto point 2 =

F ;1 X (

). The

asymp-totic variance of the outlier mean with this cuto point estimator is

 2 ` Y = ;2 Y ( F ;1 X ( ); Y) 2 yx( f Y( F ;1 X ( ))f ;1 X ( F ;1 X ( ))) 2 (1; ) + ;2 Y Varf(Y ;F ;1 X ( ))I(Y F ;1 X ( ))g:

(9)

under the following distributional settings: Normal: X N(01)Y N(  2)  Mixed normal I: X N(01)Y 0:9N(01) + 0:1N(  2) :

Mixed normal II:X N(01)Y 0:8N(01) + 0:2N(  2)

Laplace distribution: X Laplace(01) and Y Laplace( 1)

t-distribution: X t(5) and Y t(5) + 

Case I: X N(01) and Y 0:9N(01) + 0:1(

2(10) + )

Case II: X t(10) and Y 0:9t(10) + 0:1(

2(10) + )

6. Judgements for Research Results

We have several comments for the computed results in the paper:

1. The power increases as location parameter increses indicating that

when there are more wide outliers the outlier means are more ecient in detection of distributional shift.

2. For location shift models (Normal, Laplace and t distributions), the

outlier means with cuto point of larger percentage  is more powerful.

Hence, choosing smaller cuto point (larger ) is advisable for application

(10)

3. For a distributional shift of only a small proportion (Mixed normal), the outlier mean with smaller percentage  is more powerful. Hence, choosing

larger cuto point (smaller ) is advisable.

7. References

Agrawal, D., Chen, T., Irby, R., et al. (2002). Osteopontin identied as lead marker of colon cancer progression, using pooled sample expression proling. J. Natl. Cancer Inst.,

94

, 513-521.

Alizadeh, A. A., Eisen, M. B., Davis, R. E., et al. (2000). Distinct types of di use large B-cell lymphoma identied by gene expression proling.

Nature,

403

, 503-511.

Chen, L.-A., Chen, Dung-Tsa and Chan, Wenyaw. (2010). The p Value for

the Outlier Sum in Di erential Gene Expression Analysis. Biometrika, 97, 246-253.

Chen, L.-A. and Chiang, Y. C. (1996). Symmetric type quantile and trimmed means for location and linear regression model. Journal of Nonpara-metric Statistics.,

7

, 171-185.

Hoaglin, D. C., Mosteller, F. and Tukey, J. W. (1983). Understanding Ro-bust and Exploratory Data Analysis, Wiley: New York.

(11)

Ohki, R., Yamamoto, K., Ueno, S., et al. (2005). Gene expression proling of human atrial myocardium with atrial brillation by DNA microarray analysis. Int. J. Cardiol.

102

, 233-238.

Ruppert, D. and Carroll, R.J. (1980). Trimmed least squares estimation in the linear model. Journal of American Statistical Association

75

, 828-838.

Sorlie, T., Tibshirani, R., Parker, J., eta l. (2003). Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc. Natl. Acad. Sci. U.S.A.,

100

, 8418-8423.

Tibshirani, R. and Hastie, T. (2007). Outlier sums di erential gene expres-sion analysis. Biostatistics,

8

, 2-8.

Tomlins, S. A., Rhodes, D. R., Perner, S., eta l. (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer.

Science,

310

, 644-648.

Wu, B. (2007). Cancer outlier di erential gene expression detection. Bio-statistics,

8

, 566-575.

(12)
(13)

計畫主持人:陳鄰安 計畫編號: 98-2118-M-009-001-計畫名稱:由離群值建構的基因分析 量化 成果項目 實際已達成 數(被接受 或已發表) 預期總達成 數(含實際已 達成數) 本計畫實 際貢獻百 分比 單位 備 註 ( 質 化 說 明:如 數 個 計 畫 共 同 成 果、成 果 列 為 該 期 刊 之 封 面 故 事 ... 等) 期刊論文 0 0 100% 研究報告/技術報告 0 0 100% 研討會論文 0 0 100% 篇 論文著作 專書 0 0 100% 申請中件數 0 0 100% 專利 已獲得件數 0 0 100% 件 件數 0 0 100% 件 技術移轉 權利金 0 0 100% 千元 碩士生 5 0 100% 博士生 1 0 100% 博士後研究員 0 0 100% 國內 參與計畫人力 (本國籍) 專任助理 0 0 100% 人次 期刊論文 0 1 100% 研究報告/技術報告 0 1 100% 研討會論文 0 0 100% 篇 論文著作 專書 0 0 100% 章/本 申請中件數 0 0 100% 專利 已獲得件數 0 0 100% 件 件數 0 0 100% 件 技術移轉 權利金 0 0 100% 千元 碩士生 0 0 100% 博士生 0 0 100% 博士後研究員 0 0 100% 國外 參與計畫人力 (外國籍) 專任助理 0 0 100% 人次

(14)

其他成果 (無法以量化表達之成 果如辦理學術活動、獲 得獎項、重要國際合 作、研究成果國際影響 力及其他協助產業技 術發展之具體效益事 項等,請以文字敘述填 列。) 成果項目 量化 名稱或內容性質簡述 測驗工具(含質性與量性) 0 課程/模組 0 電腦及網路系統或工具 0 教材 0 舉辦之活動/競賽 0 研討會/工作坊 0 電子報、網站 0 目 計畫成果推廣之參與(閱聽)人數 0

(15)
(16)

請就研究內容與原計畫相符程度、達成預期目標情況、研究成果之學術或應用價 值(簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性)、是否適 合在學術期刊發表或申請專利、主要發現或其他有關價值等,作一綜合評估。 1. 請就研究內容與原計畫相符程度、達成預期目標情況作一綜合評估 ■達成目標 □未達成目標(請說明,以 100 字為限) □實驗失敗 □因故實驗中斷 □其他原因 說明: 2. 研究成果在學術期刊發表或申請專利等情形: 論文:□已發表 ■未發表之文稿 □撰寫中 □無 專利:□已獲得 □申請中 ■無 技轉:□已技轉 □洽談中 ■無 其他:(以 100 字為限) 3. 請依學術成就、技術創新、社會影響等方面,評估研究成果之學術或應用價 值(簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性)(以 500 字為限) 本計劃提出由離群平均來做基因選取之無母數分析。因基因資料已被證實大都為非常態且 在本人之前發表文章於審查時副編輯表示非常態之分析非常重要,故本方法將具有相當之 應用價值。

參考文獻

相關文件

It has been well-known that, if △ABC is a plane triangle, then there exists a unique point P (known as the Fermat point of the triangle △ABC) in the same plane such that it

• When a system undergoes any chemical or physical change, the accompanying change in internal energy, ΔE, is the sum of the heat added to or liberated from the system, q, and the

Then, based on these systematically generated smoothing functions, a unified neural network model is pro- posed for solving absolute value equationB. The issues regarding

(2007) demonstrated that the minimum β-aberration design tends to be Q B -optimal if there is more weight on linear effects and the prior information leads to a model of small size;

• A function is a piece of program code that accepts input arguments from the caller, and then returns output arguments to the caller.. • In MATLAB, the syntax of functions is

For the data sets used in this thesis we find that F-score performs well when the number of features is large, and for small data the two methods using the gradient of the

files Controller Controller Parser Parser.

Dynamic programming is a method that in general solves optimization prob- lems that involve making a sequence of decisions by determining, for each decision, subproblems that can