• 沒有找到結果。

基因離群變異分析

N/A
N/A
Protected

Academic year: 2021

Share "基因離群變異分析"

Copied!
25
0
0

加載中.... (立即查看全文)

全文

(1)

統計學研究所

碩 士 論 文

基因離群變異分析

Outlier Variance for Gene Expression Analysis

研 究 生:鄭秌煒

指導教授:陳鄰安 博士

(2)

基 因 離 群 變 異 分 析

Outlier Variance for Gene Expression Analysis

研 究 生:鄭秌煒 Student:Ciou-Wei Jheng 指導教授:陳鄰安 博士 Advisor:Lin-An Chen 國 立 交 通 大 學 統 計 學 研 究 所 碩 士 論 文 A Thesis

Submitted to Institute of Statistics College of Science

National Chiao Tung University in Partial Fulfillment of the Requirements

for the Degree of Master

in Statistics June 2010

Hsinchu, Taiwan, Republic of China

(3)

i

基 因 離 群 變 異 分 析

學生:鄭秌煒 指導教授:陳鄰安 博士 國立交通大學統計學研究所 摘 要 在基因分析的研究中,尋找致病基因觀察值中的離群樣本是很重要的。離 群和或離群平均可以檢測離群值分配的集中趨勢,但無法偵測其他特性如離中 趨勢等。我們提出用離群變異數來做一個基因分的工具。我們推倒了離群變異 的大樣本分配並由此建立一個檢定方法。這個檢定方法與離群平均也一起比較 他們的檢力表現。另外我們提出由分位值來建立一個離群變異,這個離群變異 量可以省去估計機率密度的困難。

(4)

ii

誌 謝

感謝兩年來所上老師們的教導,尤其是陳鄰安老師的指導,才能順利完成 這篇論文,從老師身上除了學到解決問題的能力之外,感受到老師對研究和教 學上的熱情,會以老師為榜樣,在未來的日子,懷抱熱情並以求知的精神走下 去。同時也要謝謝所上行政人員,在剛來統研所這個新環境時的幫助,還有同 學們在課業上互相學習、競爭,這對我助益良多。很高興能在統研所和老師、 同學一起度過兩年。 最後,感謝父母給於生活上的安定,讓我可以專心於學業上,不必為其他 事操煩,有父母的付出、老師的教導、同學們的關愛讓我能順順利利完成學業。 鄭 秋 煒 謹至于 國 立 交 通 大 學 統 計 學 研 究 所 中華民國九十九年六月

(5)

iii

Abstract (in Chinese) ... i

Acknowledgements (in Chinese) ... ii

Contents

1. Introduction ... 1

2. Population Outlier variance... 3

3. Sample Outlier Variance Based Gene Expression Analysis ... 5

4. Power Performance Evaluation ... 7

5. Outlier Variance With Quantile Based Cutoff Point ... 12

6. Simulation study ... 15

7. Appendix ... 18

(6)

iv

List of Tables

Table 1. Table of parameter θ making outlier means equal... 4

Table 2. Ratio of population outlier variances ... 4

Table 3. Power performances of outlier mean and outlier variance tests .... 8

Table 4. Outlier mean ratio and outlier standard deviation ratio ... 9

Table 5. Power performances of outlier mean and outlier variance tests .. 10

Table 6. Power performance for outlier variance B test ... 11

Table 7. Power performances of new outlier mean and outlier variance tests ... 13

Table 8. Power performances of new outlier variance test ... 13

Table 9. Power performances of new outlier variance test with implement of estimate ... 14

Table 10. Power performance comparison by simulation (Case I) ... 15

(7)

Outlier Variance for Gene Expression Analysis

SUMMARY

Discovering the existence of outliers in samples of inuential genes is a very new and important approach for gene expression analysis. The outlier sum or outlier mean technique can detect the shift in central tendency for the outlier data but not other characteristics such as spreadness for the outlier data. We propose the outlier variance to measure the spreadness of the outlier data as an alternative tool for gene expression analysis. Large sam-ple theory for this outlier variance is then developed and a test based this outlier variance is then compare with the outlier mean for their power per-formances. To avoid the ineciency in estimating densities at tail quantiles for an estimate of asymptotic variance of the sample outlier variance, we further consider using the empirical quantile function as the sample cuto point to propose an alternative outlier variance based test.

1. Introduction

DNA microarray technology, which simultaneously probes thousands of gene expression proles, has been successfully used in medical research for disease classication (Agrawal et al. (2002) Alizadeh et al. (2000) Ohki et al. (2005)) Sorlie et al. (2003)). Among the existed techniques in dieren-tial genes detection, common statistical methods for two-group comparisons such ast-test, are not appropriate due to a large number of genes expressions and a limited number of subjects available. Several statistical approaches have been proposed to identify those genes where only a subset of the sam-ple genes has high expression. Among them, Tomlins et al. (2005) observed that there is small number of outliers in samples of dierential genes and then introduced a method called cancer outlier prole analysis that identies outlier proles by a statistic based on the median and the median absolute deviation of a gene expression prole. With this observation, a sequence of approaches then concentrated on detecting dierential genes based on out-lier samples while Tibshirani and Hastie (2007) and Wu (2007) suggested to use an outlier sum, the sum of all the gene expression values in the disease

TypesetbyA M S-T E X 1

(8)

group that are greater than a specied cuto point. The common disad-vantage of these techniques is that the distribution theory of the proposed methods has not been discovered so that the distribution based p value can not been applied. Recently Chen, Chen and Chan (2009) proposed a new version of outlier sum and its corresponding outlier mean and developed its large sample theory that allows us to formulate the p value based on the asymptotic distribution. In specic, they considered the parametric study by specifying the normal distribution and performed simulation studies and data analysis for gene expression analysis.

According to Tomlins et al. (2005), gene expression analysis should con-sider to verify if the distributions of variables of disease group subjects and normal group subjects on the region excessing a specied cuto point are identical. The outlier mean approach of Chen, Chen and Chiang (2010) can detect if the excessive means are dierent. We know that summarizing the outlier data by outlier mean or outlier sum may be ecient when the central tendencies of two outlier distributions on the excessive region are strongly dierent. However, it is known that it is not enough to detect just the shift in distributional mean when there exists of a distributional shift. So, it requires to measure other characteristics other than the central tendency of the outlier data as alternatives for detection of inuential genes. Here, in this paper, we consider the measurement of outlier variance to detect the shift in distributional spread or dispersion as an alternative. Interestingly this study shows that using outlier variance in detection of inuential genes is much more ecient than the outlier mean test.

In Section 2, we introduce the population outlier variance as a character-istic for detection of distributional shift. In Section 3, we study large sample property of the sample outlier variance and, in Section 4, we compare the power performances between the tests based on outlier mean and outlier variance. In Section 5, we propose an alternative outlier variance based test that avoids the estimation of densities and extreme quantiles for computing the test statistic.

(9)

2. Population Outlier Variances

Let X and Y be expression variables for group of normal subject and group of disease subject, respectively, with distribution functions FX and

FY. In a study that consists of n1 subjects in the normal control group and n2 subjects in the disease group, suppose that there aremgenes to be

inves-tigated. Their gene expression can be represented asXiji= 12:::n1j =

1:::m for normal control group and Yiji = 12:::n2j = 12:::m for

the disease group.

An important observation by Tomlins et al. (2005) from a study of prostate cancer, outlier genes are over-expressed only in a small number of disease samples. With dening a cuto point ^ determined from the data of the variable X, Tibshirani and Hastie (2007) and Wu (2007) con-sidered the sum of variables Y0

is that are over higher cuto point ^ given

by Pn 2

i=1YiI(Yi ^) as a test statistic for detection if the disease group

distribution is dierent from the normal group distribution. Latter Chen, Chen and Chan (2010) developed the asymptotic distribution for its aver-age, called the outlier mean, Yout = (Pn

2 i=1I(Yi ^)) ;1 Pn 2 i=1YiI(Yi ^)

for constructing a distribution basedpvalue. Let be the population coun-terpart of the sample cuto point ^. Basically the idea in this series of study is to verify if the unknown population outlier means as follows

Xout =E(XjX ) and Yout =E(YjY ) (2.1)

are the same. From now on, we suggest the population cuto point of the form  = 2F;1

X (1;);F ;1

X (). For stimulating the approach of outlier

variance, we show that testing equality of outlier means are not sucient for verifying equality of two distributions on excess region.

Consider the following distribution settings:

X N(01)Y 0:9N(1) + 0:1N(

2): (2.2)

For given  and , we display, in the following table, the parameter values of  that induces Xout =Yout.

(10)

Table 1.

Table of parameter  makingXout =Yout when = 0:5  =;0:1 ;0:15 ;0:20 ;0:25 ;0:5 = 0:05 3:952 3:952 3:952 3:952 3:952 = 0:1 3:182 3:182 3:182 3:182 3:182 = 0:2 2:278 2:280 2:280 2:281 2:280 = 0:3 1:674 1:682 1:688 1:692 1:693 = 0:4 1:227 1:254 1:275 1:291 1:321

The existence of equal population outlier means indicates that the outlier mean approach can not solve this problem that two distributions on regions exceeding  are denitely un-equal.

Known not enough to detect the dierence in outlier means, a natural alternative is to infer population outlier variance

2

Yout =;1

Y Ef(Y ;Yout)

2I(Y )

g (2.2)

for variable Y where Y = P(Y ). This is to measure the degree to

which outlier observations are (or are not) clustered around the outlier mean

Yout. This measure of a truncated dispersion do not take every observation

into account. The idea behind this approach is to verify if 2

Yout is dierent

from the population outlier variance for variable X as

2 Xout =;1 X Ef(X;Xout) 2I(X ) g (2.3) where X =P(X ).

We design situations that the population outlier means are identical for comparing their corresponding population outlier variances. For the follow-ing distribution settfollow-ing:

X N(01) andY 0:9N(;0:11) + 0:1N(

2) (2.4)

we choose parameters and  so that their corresponding outlier means are identical and then to compute the ratios 2

Xout= 2

Yout. The results of

population outlier variances are listed in Table 2.

Table 2.

Ratio of population outlier variances 2

Xout= 2

Yout when

(11)

 2 = 0:01 0:05 0:15 0:05 3:944 1:710 1:258 ( = 5:115) ( = 4:961) ( = 4:468) 0:1 5:301 1:963 1:349 ( = 4:074) ( = 3:971) ( = 3:591) 0:2 6:920 2:387 1:486 ( = 2:845) ( = 2:798) ( = 2:557) 0:3 3:016 2:010 1:414 ( = 2:007) ( = 1:991) ( = 1:853) 0:4 1:613 1:467 1:262 ( = 1:375) ( = 1:373) ( = 1:320) When the ratio is 1, the population outlier variances are also identical and there is no chance to detect a distributional dierence through outlier vari-ance approach. Interestingly, the computed ratios in Table 2 for that their corresponding outlier means are identical are all larger than 1 indicating that a test based on outlier variance has an addtional chance for observing distributional dierence.

3. Sample Outlier Variance Based Gene Expression Analysis

Let ^F;1

X be the empirical quantile function for estimating population

quantile function F;1

X and we estimate the cuto point  by ^ = 2 ^F;1

X (1; );F^

;1

X () for some 0<  <0:5. We propose a sample outlier variance as

S2 Yout =(n 2 X i=1 IfYi 2 ^F ;1 X (1;);F^ ;1 X ()g) ;1 n2 X i=1 (Yi;Yout) 2I fYi 2 ^F ;1 X (1;);F^ ;1 X ()g (3.1)

This statistic using those observations from disease group exceeding the sample cuto point does provide a concise summary of dispersion for the outlier data.

Let us now display the asymptotic properties of the outlier variance

S2

Yout. A Bahadur representation of S2

Yout and its asymptotic distribution

are stated in the follwoing theorem.

Theorem 3.1.

Suppose that assumptions (A2) and (A3) in the Appendix

(12)

(a) A Bahadur representation of the outlier variance is n1=2 2 (S 2 Yout ; 2 Yout) =;(1;)af ;1 X fF ;1 X ()g+ 2af ;1 X fF ;1 X (1;)g]n ;1=2 1 n1 X i=1 IfXi F ;1 X ()g + af;1 X fF ;1 X ()g+ 2af ;1 X fF ;1 X (1;)g]n ;1=2 1 n1 X i=1 IfF ;1 X ()XiF ;1 X (1;)g + af;1 X fF ;1 X ()g;2(1;)af ;1 X fF ;1 X (1;)g]n ;1=2 1 n1 X i=1 IfXi F ;1 X (1;)g +;1 Y n;1=2 2 n2 X i=1 f(Yi;Yout) 2 ; 2 YoutgI(Yi ) +op(1) where a=;1 Y f(;Yout) 2 ; 2 YoutgfY() 1=2 xy (b) n1=2 2 (S 2 Yout; 2

Yout) converges in distribution to N(0vY) where

vY =;(1;)af ;1 X fF ;1 X ()g+ 2af ;1 X fF ;1 X (1;)g] 2 + (1;2)af ;1 X fF ;1 X ()g+ 2af ;1 X fF ;1 X (1;)g] 2 +af;1 X fF ;1 X ()g;2(1;)af ;1 X fF ;1 X (1;)g] 2 +;2 Y Ef(Y ;Yout) 2 ; 2 Youtg 2I(Y )]:

Following Theorem 3.1, the following variable

n1=2 2 v ;1=2 Y (S2 Yout; 2 Yout)

converge to N(01) in distribution. For testing if the distributions of Y

and X by outlier variance, we are testing this hypothesis by comparing two outlier variances 2

Yout and 2

Xout and then we should choose ^ 2

Xout, an

estimate of 2

Xout and an estimate ^v to form a test statistic

n1=2 2 v^ ;1=2(S2 Yout ; ^ 2 Xout) (3.2)

(13)

However, in literature, there are two choices for ^v, it can be an estimate of vY or an estimate of vX with vX =;(1;)af ;1 X fF ;1 X ()g+ 2af ;1 X fF ;1 X (1;)g] 2 + (1;2)af ;1 X fF ;1 X ()g+ 2af ;1 X fF ;1 X (1;)g] 2 +af;1 X fF ;1 X ()g;2(1;)af ;1 X fF ;1 X (1;)g] 2 +;2 X Ef(X;Xout) 2 ; 2 Xoutg 2I(X )]:

Hence there are two tests available based on outlier variance as rejecting H0 if n 1=2 2 ^v ;1=2 Y (S2 Yout; ^ 2 Xout) z: (3.3) and rejecting H0 if n 1=2 2 ^v ;1=2 X (S2 Yout; ^ 2 Xout) z: (3.4)

where ^vY and ^vX are, respectively, estimates of vY and vX.

But how good are these two tests? An important part for an evaluation is to verify its power performance when there exists positive outlier in data of disease group.

4. Power Performance Evaluation

We evaluate the powers of test (3.3) for several distributional settings. An approximate power function for this test may be derived as follows:

pvy =PFY fn 1=2 2 ^v ;1=2 Y (S2 Yout; ^ 2 Xout) zg =PFY fn 1=2 2 v ;1=2 Y (S2 Yout ; 2 Yout) v;1=2 Y (zv^1=2 Y +n1=2 2 (^ 2 Xout; 2 Yout))g PfZ v ;1=2 Y (zv^1=2 Y +n1=2 2 (^ 2 Xout; 2 Yout))g PfZ z+n 1=2 2 v ;1=2 Y ( 2 Xout; 2 Yout)g (4.1)

Similarly, the test of (3.4) has an approximate power as

pv PfZ z( vX vY ) 1=2+n 1=2 2 v ;1=2 Y ( 2 Xout; 2 Yout)g (4.2)

(14)

We are ready to study asymptotic powr for comparison with outlier mean where the following distributional settings

Normal: X N(01)Y N( 2) Mixed normal: X N(01)Y 0:9N(01) + 0:1N( 2) Mixed 2 :X N(01)Y 0:9N(01) + 0:1( 2(10) +)

are considered wherepmandpvrepresent, respectively, the powers for outlier

mean and outlier variance.tests.

Table 3.

Power performances of outlier mean and outlier variance tests

  = 2  = 4  = 6 Normal = 0:1pm 0:700 0:968 1 pv 0:044 0:161 0:517 = 0:2pm 0:926 0:999 1 pv 0:918 0:987 0:998 = 0:3pm 0:984 1 1 pv 0:980 0:998 0:999 = 0:4pm 0:996 1 1 pv 0:992 0:999 0:999 Mixed normal = 0:1pm 0:232 0:421 0:670 pv 0:291 0:372 0:506 = 0:2pm 0:282 0:492 0:726 pv 0:681 0:780 0:852 = 0:3pm 0:212 0:324 0:434 pv 0:755 0:864 0:937 = 0:4pm 0:177 0:253 0:316 pv 0:762 0:860 0:925 Mixed 2 = 0:1pm 0:924 0:985 0:998 pv 0:767 0:768 0:768 = 0:2pm 0:929 0:975 0:991 pv 0:838 0:831 0:818 = 0:25pm 0:773 0:823 0:854 pv 0:881 0:880 0:874 = 0:4pm 0:383 0:397 0:407 pv 0:945 0:965 0:978

(15)

(a). The power increases as location parameter  increses indicating that when there are more wide outliers the outlier means and the outlier variance are more ecient in detection the existence of distributional dierence. (b). Consider the location shift models (Normal, Laplace and t distribu-tions). The outlier means and outlier variances with cuto point of larger percentage  are relatively more powerful. Hence, choosing smaller cut-o point (larger ) is advisable for application when there is a dierence in location parameter. However, in this distributional settings, the outlier variance with smaller  (0:1) is not a poweful one.

(c). For a distributional dierence of only a small proportion of sample points (Mixed normal), the outlier mean with all percentages are inecient with small powers. However, the outlier variances are relatively more pow-erful especially for larger 0s.

(d). In an over all comparison, since there is specic distribution being known in nonparametric hypothesis testing and it is supposed to have only a small proportion of outliers in the inuential genes, the outlier variance with cuto point of  larger than 0:25 is recommended.

For verication of the above conclusions, we consider the mixed normal distribution case with = 3 to compute the following ratios

m =;1

XoutYoutv = ;1

Xout Yout:

Table 4.

Outlier mean ratio and outlier standard deviation ratio

  = 2  = 4  = 6 = 0:05m 1:273 1:370 1:514 v 7:369 9:003 10:98 = 0:1m 1:391 1:543 1:767 v 6:744 8:251 9:972 = 0:2m 1:593 1:880 2:278 v 5:821 7:178 8:622 = 0:3m 1:549 1:931 2:420 v 4:600 6:163 7:912 = 0:4m 1:430 1:770 2:192 v 3:102 4:379 5:867

(16)

The ratios of population outlier variance are much larger than the corre-sponding ratios of population outlier means that provides a message for the eciencies in powers obtained from the outlier variance test.

We further consider the following distributional settings for comparison: Model I: X Laplace(01)Y 0:9Laplace(01) + 0:1Laplace(10)

Model II: X t(10)Y 0:9t(10) + 0:1(

2(10) +)

and the results are listed in Table 5.

Table 5.

Power performances of outlier mean and outlier variance tests

  = 2  = 4  = 6 Model I = 0:1pm 0:2289 0:260 0:3009 pv 0:6293 0:6418 0:6549 = 0:2pm 0:2172 0:2511 0:2939 pv 0:6645 0:6799 0:6958 = 0:25pm 0:2088 0:2403 0:277 pv 0:673 0:6899 0:7079 = 0:4pm 0:1971 0:2222 0:2485 pv 0:6851 0:7044 0:7251 Model II = 0:1pm 0:872 0:968 0:995 pv 0:527 0:536 0:542 = 0:2pm 0:863 0:928 0:96 pv 0:823 0:828 0:824 = 0:25pm 0:719 0:771 0:803 pv 0:887 0:902 0:908 = 0:3pm 0:542 0:571 0:591 pv 0:938 0:963 0:978 = 0:4pm 0:385 0:401 0:412 pv 0:942 0:963 0:977

We have several comments drawing from the results in the above table: (a) On Model I, the two methods are both not very powerful in detection of inuential genes. However, the outlier variance seems to be much more better.

(b) On Model II, the two methods are more powerful in the purpose. The outlier means show better in smaller0sand the outlier variances show

(17)

bet-ter in larger0s. This provides a guidence for users in choosing appropriate

outlier mean and outlier variance.

(c) In overall evaluation, the outlier variance method seems to be quite robust of with powers larger than 0:5 in all situations. This observation indicates that the outlier variance approach for gene expression analysis seems to be desirable in application since, in gene expressions, the underlying distribution is generally non-normal.

Suppose that an outlier mean test is conducted and it results in accep-tance of equal outlier means. From Table 2, it is seen that larger values of 2

Yout than 2

Xout shows that an left handed one sided outlier variance

based test in this situation is appropriate. We propose the following left handed outlier variance test:

rejecting H0 if n 1=2 2 v^Y ;1=2( S2 Yout; ^ 2 Xout);z

An approximate power function may be derived as follows:

pvB PfZ ;z+n 1=2 2 v ;1=2 Y ( 2 Xout; 2 Yout)g:

We list the computed powers for distributions of (2.4) with = 1 in Table 6.

Table 6.

Power performance for outlier variance B test when outlier means are identical  =;0:01 ;0:1 ;0:5 ;1:0 ;1:5 pm 0:05 0:05 0:05 0:05 0:05 pvB 0:05 0:352 0:352 0:352 0:352 0:352 0:1 0:970 0:971 0:973 0:974 0:974 0:2 0:820 0:917 1 1 1 0:3 0:233 0:284 0:686 0:999 1 0:4 0:135 0:160 0:302 0:668 0:997 This observation shows that when we accept the null hypothesis of equal

outlier means through the outlier mean based test it suggests to further test the outlier variance by left hand one sided test.

(18)

5. Outlier Variance With Quantile Based Cuto Point

The test based on outlier variance S2

Yout requires to estimate density

points fXfF ;1

X ()g and fXfF ;1

X (1;)g ((b) of Theorem 3.1). There is

generally no satisfactory solution for this estimation unless the sample sizes are large enough. Here we consider an alternative design of the cuto point for a new outlier variance. We let  = F;1

X ( ) and ^ = ^F;1

X ( ). In the

following, we state the large sample theory for this outlier variance.

Theorem 5.1.

Suppose that assumptions (A2) and (A3) in the Appendix

are true.

(a) A Bahadur representation of the outlier variance is

n1=2 2 (S 2 Yout ; 2 Yout) =; ;1 Y f(F ;1 X ( );Yout) 2 ; 2 YoutgfY(F ;1 X ( ))f;1 X (F;1 X ( ))( xy)1=2n ;1=2 1 n1 X i=1 ( ;I(Xi F ;1 X ( ))) +;1 Y n;1=2 2 n2 X i=1 (Yi;Yout) 2 ; 2 Yout]I(Yi F;1 X ( )) +op(1): (b) n1=2 2 (S 2 Yout; 2

Yout) converges in distribution to N(0vY) where

vY = (1; ) ;2 Y f(F ;1 X ( );Yout) 2 ; 2 Youtg 2(f Y(F;1 X ( ))f;1 X (F;1 X ( )))2 xy +;2 Y Ef(Y ;Yout) 2 ; 2 Youtg 2I(Y F;1 X ( ))]]:

We may consider asymptotic variance vY under the assumption that Y

and X have the same distribution setting by

vX =;2 X  (1; )f(F ;1 X ( );Xout) 2 ; 2 Xoutg 2 xy+Ef(X;Xout) 2 ; 2 Xoutg 2I(X F;1 X ( ))]]:

This variance save the eort in estimating unknown density points. An outlier variance based test may be stated as

rejecting H0 if n 1=2 2 ^v ;1=2 X (S2 Yout; ^ 2 Xout) z: (5.1)

It is interesting to study the power performance of this outlier variance based nonparametric test for models with only a small proportion of the data

(19)

in disease group been shifted. Observed from Tomblins et al. (2005), this happen in some regular cancer genes. We rst consider the mixed normal distribution.

Table 7.

Power performances of new outlier mean and outlier variance test for mixed normal distribution

 = 2  = 4  = 6  = 0:1 = 0:8pm 0:523 0:701 0:805 pv 0:764 0:864 0:930 = 0:85pm 0:541 0:709 0:809 pv 0:764 0:866 0:936 = 0:9pm 0:557 0:716 0:812 pv 0:762 0:868 0:941 = 0:95pm 0:563 0:711 0:810 pv 0:754 0:861 0:934  = 0:2 = 0:8pm 0:710 0:866 0:936 pv 0:871 0:957 0:991 = 0:85pm 0:710 0:863 0:935 pv 0:867 0:955 0:990 = 0:9pm 0:705 0:857 0:933 pv 0:860 0:950 0:986 = 0:95pm 0:679 0:837 0:924 pv 0:840 0:932 0:971

The use of quantile to construct the cuto point still shows the advantage better performance by the outlier variance approach. Also, by comparing with the results in Table 3, the use of quantile for constructing the cuto point is competitive with using quantile combination for constructing the cuto point.

We further consider the following distribution settings: Case I: X N(01) and Y 0:9N(01) + 0:1(

2(10) +)

Case II: X t(10) and Y 0:9t(10) + 0:1(

2(10) +)

for investigation and the results are displayed in Table 8.

(20)

 = 2  = 4  = 6 Case I = 0:8pm 0:895 0:907 0:916 pv 0:950 0:971 0:983 = 0:85pm 0:896 0:908 0:916 pv 0:955 0:976 0:989 = 0:9pm 0:897 0:909 0:917 pv 0:957 0:979 0:991 = 0:95pm 0:899 0:911 0:919 pv 0:935 0:953 0:964 Case II = 0:8pm 0:881 0:896 0:907 pv 0:946 0:968 0:982 = 0:85pm 0:880 0:896 0:906 pv 0:950 0:973 0:987 = 0:9pm 0:879 0:895 0:905 pv 0:950 0:975 0:989 = 0:95pm 0:873 0:892 0:903 pv 0:921 0:943 0:957

The outlier mean and outlier variance are performed quite well in models Case I and Case II. This support the observation by Tomlins et al. (2005) that when outliers exist in inuential genes the gene expression techniques should take the outliers into more consideration. The following table is to display the results with designing the use ofvY for constructing the quantile

based outlier variance.

Table 9.

Power performances of new outlier variance test with implemet of estimate vY ( = 0:1) = 2  = 4  = 6 Case I = 0:8 0:507 0:602 0:687 = 0:85 0:527 0:637 0:739 = 0:9 0:535 0:656 0:767 = 0:95 0:455 0:519 0:566 Case II = 0:8 0:500 0:595 0:682 = 0:85 0:516 0:627 0:731 = 0:9 0:518 0:639 0:752 = 0:95 0:436 0:499 0:548

(21)

The results showed above are less powerful than the implement of estimate

vX. The test (5.1) that is based on vY is not with benet of avoiding the

estimation of density points.

6. Simulation study

We consider a simulation study in the comparison of the quantile based outlier variance with the outlier mean and classical two samplet test. Den-ing estimates ^ X = 1n 1 n1 X i=1 I(Xi F^;1 X ( ))^Xout = Pn 1 i=1XiI(Xi ^ F;1 X ( )) Pn 1 i=1I(Xi F^ ;1 X ( ))  ^ 2 Xout = Pn 1 i=1(Xi ;^Xout) 2I(X i F^;1 X ( )) Pn 1 i=1I(Xi ^ F;1 X ( )) 

Suppose that we have test statistic T = n1=2

2 v^ ;1=2 X (S2 Yout; ^ 2 Xout) and

its observation at ith replication is Ti. We search constant c such that

0:05 1 m m X i=1 I(Ti cjH 0 :FY =FX) (6.1)

and then apply this constant as the cuto point to evaluate the following power 1 m m X i=1 I(Ti cjH 1):

In the follwoing tables, we list the simulated probability under H0 at the

simulated constant c and the simulated powers under distributions Case I and Case II.

Table 10.

Power performance comparison by simulation (Case I,n1 =n2 =

(22)

H0  = 2  = 4  = 6 pt 0:049 0:459 0:482 0:504 = 0:5 pm(c= 2:16) 0:0527 0:9109 0:9303 0:9419 pv(c= 4:48) 0:051 0:9569 0:9597 0:9597 = 0:55 pm(c= 2:23) 0:0501 0:9167 0:9332 0:9443 pv(c= 4:98) 0:0508 0:9588 0:959 0:9592 = 0:6 pm(c= 2:28) 0:0504 0:9192 0:9355 0:9443 pv(c= 5:35) 0:0506 0:9581 0:9596 0:9596 = 0:65 pm(c= 2:37) 0:0523 0:9227 0:9394 0:9474 pv(c= 6:1) 0:0508 0:9582 0:9602 0:9599 = 0:7 pm(c= 2:48) 0:0513 0:9227 0:9387 0:9469 pv(c= 6:68) 0:0503 0:9581 0:9599 0:9493 = 0:75 pm(c= 2:74) 0:0511 0:9225 0:9388 0:9493 pv(c= 8:23) 0:0492 0:9562 0:9589 0:9606 = 0:8 pm(c= 2:96) 0:0526 0:9243 0:9388 0:9486 pv(c= 9:5) 0:0498 0:9559 0:9576 0:9589 = 0:85 pm(c= 3:8) 0:0508 0:9169 0:9332 0:942 pv(c= 13:9) 0:0519 0:9496 0:952 0:9528 = 0:9 pm(c= 4:81) 0:051 0:9034 0:926 0:9368 pv(c= 21:3) 0:0507 0:9388 0:944 0:9444 = 0:95 pm(c= 20:8) 0:0502 0:6608 0:7208 0:7659 pv(c= 200) 0:2932 0:9296 0:9315 0:9293

Table 11.

Power performance comparison by simulation ( Case II, n1 = n2 = 30)

(23)

H0  = 2  = 4 = 6 pt 0:049 0:448 0:473 0:492 = 0:5 pm(c= 2:35) 0:0497 0:8881 0:9119 0:9281 pv(c= 6:35) 0:05 0:9506 0:9562 0:9582 = 0:55 pm(c= 2:42) 0:0501 0:8932 0:9166 0:9304 pv(c= 7:18) 0:0501 0:9496 0:9565 0:9587 = 0:6 pm(c= 2:47) 0:0508 0:8918 0:9159 0:9336 pv(c= 7:64) 0:0509 0:9483 0:9552 0:9596 = 0:65 pm(c= 2:65) 0:0492 0:8925 0:9167 0:9316 pv(c= 8:83) 0:0507 0:9467 0:9545 0:9582 = 0:7 pm(c= 2:75) 0:051 0:8956 0:917 0:9344 pv(c= 9:8) 0:0497 0:9465 0:9541 0:9592 = 0:75 pm(c= 3:05) 0:05 0:8924 0:9168 0:9308 pv(c= 11:87) 0:0508 0:9429 0:9532 0:9572 = 0:8 pm(c= 3:36) 0:0494 0:8847 0:9109 0:9288 pv(c= 13:67) 0:0496 0:9378 0:95 0:9548 = 0:85 pm(c= 4:25) 0:0503 0:868 0:9001 0:9185 pv(c= 20:5) 0:0505 0:9221 0:9366 0:9439 = 0:9 pm(c= 5:45) 0:0505 0:8366 0:8775 0:9019 pv(c= 31) 0:0506 0:8935 0:9152 0:9258 = 0:95 pm(c= 23) 0:0502 0:5262 0:588 0:6364 pv(c= 200) 0:3004 0:9289 0:9281 0:927

We have several comments on these simulated results:

(a) The outlier mean and outlier variance techniques are both more powerful than the two samplest test showing that applying all data for inferences is not appropriate.

(b) More interestingly the outlier variance is the most ecient method in this comparison.

(24)

7. Appendix

Three assumptions for the two sample outlier variance test are as follows. ASSUMPTION 1: The limit =limn1n2

!1n

;1

1 n

2 exists.

ASSUMPTION 2: Pobability density function fX of distribution FX is

bounded away from zero in neighborhoods of F;1

X () for  2(01) and the

population cuto point .

ASSUMPTION 3: Probability density function fY is bounded away from

zero in a neighborhood of the population cuto point .

Proof of Theorem 3.1

: With Assumption 2, a representation of ^F;1

X () such as n1=2 1 fF^ ;1 X ();F ;1 X ()g=f ;1 X fF ;1 X ()gn ;1=2 1 n1 X i=1 ;IfXi F ;1 X ()g]+op(1) (7.1) implies that ^ = 2 ^F;1 X (1;);F^ ;1 X () satises T = n1=2 1 (^ ;) = Op(1)

(Ruppert & Carroll, 1980). First, we can rewrite the sample outlier variance as S2 Yout = (n 2 X i=1 I(Yi ^);1 n2 X i=1 (Yi;Yout) 2+ (Y out;Yout) 2]: (7.2)

A Bahadur representation of Yout in Chen, Chen and Chan (2009) indicates

that n1=2

2 (Yout

;Yout) = Op(1) which leads to the fact that n 1=2

2 (Yout

; Yout)2 = o

p(1) and we may write the sample outlier variance in the

fol-lowing n1=2 2 (S 2 Yout ; 2 Yout) =n1=2 2 ( n2 X i=1 I(Yi ^));1 n2 X i=1 (Yi;Yout) 2 ; 2 Yout]I(Yi +n;1=2 2 T) ;I(Yi )] +n1=2 2 ( n2 X i=1 I(Yi ^));1 n2 X i=1 (Yi;Yout) 2 ; 2 Yout]I(Yi ) +op(1): (7.3) With (A1), Assumptions 1 and 3, and techniques from Ruppert & Carroll

(1980) and Chen & Chiang (1996), we may see that

n;1=2 2 n2 X i=1 (Yi;Yout) 2 ; 2 Yout]I(Yi +n;1=2 2 T) ;I(Yi )] (7.4) =;f(;Yout) 2 ; 2 YoutgfY()T +op(1):

(25)

The rst term on the right hand side of (7.2) may be formulated as n;1 2 n2 X i=1 I(Yi ^) =n;1 2 n2 X i=1 I(Yi ) +op(1): (7.5)

Plugging (6.1) into (7.4), the theorem is followed from (7.3)-(7.5).

The proof of Theorem 5.1 is quite similar to the above one so tat we skip it.

References

Chen, L.-A., Chen, Dung-Tsa and Chan, Wenyaw. (2008). The p Value for the Outlier Sum in Dierential Gene Expression Analysis. Biometrika, 97, 246-253.

Chen, L.-A. and Chiang, Y. C. (1996). Symmetric type quantile and trimmed means for location and linear regression model. Journal of Nonpara-metric Statistics. 7, 171-185.

Ruppert, D. and Carroll, R.J. (1980). Trimmed least squares estimation in the linear model. Journal of American Statistical Association

75

, 828-838.

Tibshirani, R. and Hastie, T. (2007). Outlier sums dierential gene expres-sion analysis. Biostatistics,

8

, 2-8.

Tomlins, S. A., Rhodes, D. R., Perner, S., eta l. (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer.

Science,

310

, 644-648.

Wu, B. (2007). Cancer outlier dierential gene expression detection. Bio-statistics,

8

, 566-575.

數據

Table 1. Table of parameter  making  Xout =  Yout when 	 = 0 : 5   = ; 0 : 1 ; 0 : 15 ; 0 : 20 ; 0 : 25 ; 0 : 5  = 0 : 05 3 : 952 3 : 952 3 : 952 3 : 952 3 : 952  = 0 : 1 3 : 182 3 : 182 3 : 182 3 : 182 3 : 182  = 0 : 2 2 : 278 2 : 280 2 : 280 2
Table 3. Power performances of outlier mean and outlier variance tests
Table 4. Outlier mean ratio and outlier standard deviation ratio
Table 5. Power performances of outlier mean and outlier variance tests
+6

參考文獻

相關文件

It has been well-known that, if △ABC is a plane triangle, then there exists a unique point P (known as the Fermat point of the triangle △ABC) in the same plane such that it

• When a system undergoes any chemical or physical change, the accompanying change in internal energy, ΔE, is the sum of the heat added to or liberated from the system, q, and the

An algorithm is called stable if it satisfies the property that small changes in the initial data produce correspondingly small changes in the final results. (初始資料的微小變動

Then, based on these systematically generated smoothing functions, a unified neural network model is pro- posed for solving absolute value equationB. The issues regarding

(2007) demonstrated that the minimum β-aberration design tends to be Q B -optimal if there is more weight on linear effects and the prior information leads to a model of small size;

• A function is a piece of program code that accepts input arguments from the caller, and then returns output arguments to the caller.. • In MATLAB, the syntax of functions is

For the data sets used in this thesis we find that F-score performs well when the number of features is large, and for small data the two methods using the gradient of the

files Controller Controller Parser Parser.