• 沒有找到結果。

Coverage Intervals for Gamma and Exponential Distributions

Table 3.

Powers for Normal distribution N(2) (one-sided)

n= 20 n= 30 n= 50 n= 100 n= 500

1 = 12 =;1 0:00613 0:00540 0:00485 0:00446 0:00415

1 = 12= 1 0:28592 0:27725 0:27022 06489 0:26059

1 = 12 =;2 0:00025 0:00020 0:00017 0:00015 0:00013

1 = 12= 2 0:65648 0:65076 0:64605 0:64244 0:63950

1 = 22 =;1 0:03662 0:03579 0:03514 0:03466 0:03428

1 = 22= 1 0:57603 0:57415 0:57267 0:57156 0:57068

1 = 22 =;2 0:00269 0:00258 0:00250 0:00244 0:00239

1 = 22= 2 0:81616 0:88124 0:88095 0:88073 0:88056

Table 4.

ARL for Normal distribution N(2) (one-sided)

n= 20 n= 30 n= 50 n= 100 n= 500

1 = 12 =;1 162:97 185:09 206:09 224:20 240:40

1 = 12 = 1 3:4974 3:6068 3:7006 3:7751 3:8374

1 = 12 =;2 3913:6 4784:8 5676:4 6494:5 7264:3

1 = 12 = 2 1:5233 1:5367 1:5479 1:5566 1:5637

1 = 22 =;1 27:307 27:939 28:454 28:843 29:164

1 = 22 = 1 1:7360 1:7417 1:7462 1:7496 1:7523

1 = 22 =;2 371:39 386:786 399:583 409:474 417:573

1 = 22 = 2 1:2252 1:1348 1:1351 1:1354 1:1356

4. Coverage Intervals for Gamma and Exponential Distributions

Consider the Gamma distribution ;(k) with pdf of the form f(x) = 1;(k)kxk;1e;x=x >0:

The th quantile of this distribution is F;1() = 2 22k(). The one sided 1;coverage interval isC(1;) = (02 22k(1;)). With mle ^ = Pni=1nkxi, a sample coverage interval is

C^(1;) = (0

Pni=1xi

2nk 22k(1;)):

Suppose that the true coverage interval is C(1;) = (020 22k(1;)). The

power function is a function of parameter as

We list the power and ARL results for this Gamma distribution in Tables 5 and 6.

Table 5.

Powers for Gamma distribution ;(k) (one-sided)

= 0:5 = 1 = 5 = 20

Table 6.

ARL for Gamma distribution ;(k) (one-sided)

= 0:5 = 1 = 5 = 20

For two sided coverage interval 2( 22k(2) 22k(1; 2)), its estimate is We then see that the power of this coverage interval estimate is

() = 1;P( 0

2k 22k(

2 )F(2k2nk) 0

2k 22k(1;  2 )):

Some of the power and ARL results for this two sided consideration are listed in Tables 7 and 8.

Table 7.

Powers for Gamma distribution ;(k) (two-sided)

= 0:5 = 1 = 5 = 20

Table 8.

ARL for Gamma distribution ;(k) (two-sided)

= 0:5 = 1 = 5 = 20

LetX1:::Xnbe a random sample drawn from the exponential distribution

An appropriate estimate of  is X and then a sample 100(1;)% coverage interval is

(;Xln (1; 

2 );Xln ( 2)):

Suppose that the parameter for healthy people is 0. The type I error probability is deriving as follows: of type II error when the true parameter is  is

=P(Type II error) =P(; 1 list the results in Tables 9 and 10.

Table 9.

Powers for Exponential distribution Exp() (two-sided) (Assume

0 = )

n= 5 n= 20 n= 30 n= 50

 = 0:2 0:11795 0:11855 0:11867 0:11876

 = 0:5 0:05988 0:05118 0:05069 0:05037

 = 0:8 0:06916 0:04690 0:04484 0:04328

 = 1 0:08803 0:05884 0:05582 0:05345

 = 1:5 0:15203 0:11506 0:11080 0:10738

 = 2 0:22060 0:18388 0:17954 0:17602

 = 2:5 0:28451 0:25090 0:24690 0:24366

 = 3 0:34147 0:31161 0:30806 0:30518

Table 10.

ARL for Exponential distribution Exp() (two-sided) (Assume

0 = )

n= 5 n= 20 n= 30 n= 50

 = 0:2 8:4777 8:4349 8:4267 8:4201

 = 0:5 16:697 19:536 19:724 19:850

 = 0:8 14:459 21:320 22:296 23:100

 = 1 11:358 16:993 17:913 18:706

 = 1:5 6:5775 8:6910 9:0245 9:3119

 = 2 4:5329 5:4382 5:5697 5:6809

 = 2:5 3:5147 3:9856 4:0501 4:1040

 = 3 2:9285 3:2091 3:2461 3:2767 Let's now consider the one sided coverage interval (0;ln()) that is esti-mated by (0;Xln ()). The probability of type II error is

=P(Type II error) =P(0< F(22n); 1

ln()): Again, 1;= 0:95, we list the power and ARL in Tables 11 and 12.

Table 11.

Powers for Exponential distribution Exp() (one-sided) (Assume

0 = )

n= 5 n= 20 n= 30 n= 50

 = 0:2 0:00098 0:00001 0:00000 0:00000

 = 0:5 0:01947 0:00529 0:00424 0:00318

 = 0:8 0:06111 0:03230 0:02934 0:02702

 = 1 0:09562 0:06172 0:05753 0:05450

 = 1:5 0:18631 0:14902 0:14464 0:14109

 = 2 0:26977 0:23588 0:21848 0:22858

 = 2:5 0:34157 0:31230 0:30882 0:30600

 = 3 0:40235 0:37740 0:37444 0:37204

Table 12.

ARL for Exponential distribution Exp() (one-sided) (Assume

0 = )

n= 5 n= 20 n= 30 n= 50

 = 0:2 1018:5 71428 188679 500000

 = 0:5 51:336 188:80 235:69 286:80

 = 0:8 16:363 30:954 34:081 37:005

 = 1 10:457 16:201 17:381 18:346

 = 1:5 5:3673 6:7101 6:7136 7:0873

 = 2 3:7068 4:2394 4:5770 4:3748

 = 2:5 2:9276 3:2020 3:2381 3:2679

 = 3 2:4854 2:6497 2:6706 2:6878

Topic 2:

p

Value of an Outllier Sum in Dierential Gene Expression Analysis

Abstract

Outlier sum has been proposed in Tibshirani and Hastie (2007) and Wu (2007) for detection of dierential genes in cancer studies where one or several disease groups show unusually high gene expression in a subset of their samples. A new outlier sum is proposed that allows us to develop its asymptotic distribution theory for formulating p value. Since it is a function of some distributional parameters, thispvalue may be computed parametrically or nonparametrically.

We further formulate parametrically this p value when normal distribution for gene variables is assumed. To investigate thisp value, we perform a simulation and conduct a real data analysis which indicates that this outlier sum not only allows us to compute p values for genes but is also "exible for treatment of various structures of distribution for gene variables.

Key words

: Gene expression analysis outlier sum p value.

5. Introduction

Microarray technology by probing thousands of genes simultaneously has been successfully used in medical research to classify dierent diseases (see this point in, for examples, Agrawal et al. (2002) Alizadeh et al. (200 0) Ohki et al. (2005) Sorlie et al. (2003)). For example, two molecular subtypes of breast cancer (two distinct gene expression patterns), luminal A and basal-like

subtypes, have been reported to have dierent clinical outcome (see Sorlie et al.

(2003)). Another example is diuse large B-cell lymphoma (DLBCL). Patients with one particular molecular pattern, germinal centre B-like DLBCL, had a signicant better overall survival than those with another molecular pattern, activated B-like DLBCL (see Alizadeh et al. (2000)). Furthermore, microarray analysis has been advanced to identify oulier genes which are over-expressed only in a small number of disease samples (see Beer et al. (2002) Tibshi-rani and Hastie (2007) Tomlins et al. (2005)), such as recurrent chromosomal rearrangements (one type of chromosomal mutation), which is common in lym-phoma and leukemia, but rare in other cancers. Standard statistical methods for two-group comparisons (e.g., t-tests) have a limitation to identify these genes to distinguish tumor versus normal samples.

Several statistical approaches have been proposed to address this issue of

nding those genes where only a subset of the samples has high expression.

Among the proposals, Tomlins et al. (2005) introduced a method called cancer outlier prole analysis (COPA). Latter, Tibshirani and Hastie (2007) intro-duced a sum of the values in the cancer group, called the outlier sums, and showed that the technique of outlier sums is noticeably better in simulation of pvalues than the technique of COPA. There is an alternative outlier sums - like statistic proposed by Wu (2007). Basically, these methods of outlier sums pool outlier score which is a standardized score centered at median and scales by median absolute deviation in various ways. A larger outlier score indicates an outlier gene. The outlier sum statistics are very promising in detecting genes where only a subset of their samples have high expression. Unfortunately, without development of distribution theory for the outlier sum statistic, its power (see the simulations in Tibshirani and Hastie (2007)) in gene expression analysis relies on that the number of genes with samples having high expression is known. However, this is usually not true in practice and then there is no natural cut o point to decide the number of in"uential genes.

We propose the non-standardized outlier sum statistics and develop a tech-nique for computing p values for genes. One interesting result is that this technique will generally produce a cut o point to classify the genes into class

of outlier genes and non-outlier genes. So, this would not require that there is only one outlier gene. The studies of gene expression detection such as the t test, Tibshirani and Hastie (2007) and Wu (2007) all assume that the un-derlying distributions for all genes are normal distributions. Hence, under this distribution, we further derive a simpler formula for pvalues and perform sim-ulations evaluate its ability in detection of outlier genes. A formula developed in this paper makes the study ofp values in parameteric of other distributions and nonparametric techniques is straight forward, however, we would not go further for this.

相關文件