• 沒有找到結果。

Simulation and Data Analysis



 )

(z)dz: (7.6)

Thepvalue of (7.6) uses only xandsx to estimatex andxfor formulating

^ and ^ . The computation of p value under normality assumption is very simple. If it is the situation that Gx and Gy are known but not normal, this procedure of establishing p value may be analogously derived.

8. Simulation and Data Analysis

It is desired to evaluate the ability of outlier sum in detecting signicant genes through thepvalues of genes. We restrict this evaluation for that the un-derlying distributions are normal that are generally assumed in the approaches of Tibshirani and Hastie (2007) and Wu (2007). Under the normal assumption, the outlier sum statistic may be formulated as

~b =Xn2

i=1YiI(Yi >X + 3kz0:75Sx) (8.1) where X andSx are, respectively, sample mean and sample standard deviation based on sample of normal group people. This outlier sum is equivalent to the proposals of Wu (2007) when k = 1. It is then interesting to study the choice of constantk for detecting signicant genes through simulation and data analysis.

We conduct two simulations. First, the classical t test has been criticized that when there are occassionally hundreds of in"uential genes if 10 thounsands genes are investigated. Hence, we generate n1 = 20 and n2 = 20 observations from N(01) and conduct 1 million replications of this data generation to compute p values of (7.6). Setting signicance level  = 0:0010:010:05 and constant k = 123, we compute the numbers of p values smaller than the

corresponding specied signicance level. The results are displayed in Table 15.

Table 15.

Numbers in 1 millions replications withp values smaller than 

 k = 1 k = 2 k = 3

0:05 57808 460 5

0:01 25231 86 2

0:001 9632 23 1

We have two conclusions drawn from the results in Table 1:

(a) Consider thatk = 1. If= 0:05, there are more than 50 thousands genes to be claimed in"uential. So, if there are totally 10 thounsands genes, then there are about 500 or more genes to be identied as in"uential. Similarly,= 0:01 and  = 0:001 indicate to have, respectively, 200 and 90 or more genes to be identied as in"uential. This shows that outlier sum of k = 1 which is equivalent to Wu (2007) is still struggled in having too many in"uential genes.

(b) Consider that k = 2. The results show that when the gene number is about 10 thousands, there will be very small numbers of in"uential genes to be identied. On the other hand, k = 3 will be almost none to be identied as in"uential gene. Hence, based on this simulation, k= 2 or 3 is an appropriate constant to contruct the outlier sum.

We rst consider a simulation to evaluate the e#ciency of the approach ofp value for dierential sum in detcting outlier genes. Let (sh) be a xed index for gene data generation. We generate n1 = 20 and n2 = 20 observations from N(01). However, we add h units for s of the samples in the second group of n2 observations. We then compute thep value of (7.6).

For the next simulation, we consider that there are in"uential genes and see the e#ciency of the approach of p value for detection of in"uential genes.

Again, we generate n1 = 20 and n2 = 20 observations fromN(01). However, we add h units for s of the samples in the second group of n2 observations.

This process is repeated 10 thousands times and we compute the averaging p value. For several values of s and h, we perform this simulation and display the simulation results of averaged p values in Tables 16 and 17.

Table 16.

Average p values of outlier sum

(sh) k = 1 k = 2 k= 3

Table 17.

Average p values of outlier sum

(sh) k = 4 k= 5 k = 6

We have several conclusions drawn from Tables 2 and 3:

(a) Consider the case that (sh) = (00). It is nice that the outlier sums in all cases ofk all have averagepvalues more than 0:4 that indicates not statistical signicant for practically non-in"uential genes.

(b) Consider that k = 1 and (sh) 6= (00). Besides few cases, the average p values are small enough that would e#ciently classify these genes as in"uential

genes. Is k = 1 appropriate for constructing outlier sum? We should remind that k = 1 may occassionally generate too many in"uential genes as we have seen in Table 15. So, it is good in detecting in"uential genes but would produce non negligible type I error.

(c) Consider that k = 2. The simulation results for (sh) = (00) in Table 16 shows that it would produce only negligible type I error. For (sh) 6= (00), when h is far enough away from 0, the outlier sum performs very well. From consideration of balanced two errors, k = 2 seems to be an appropriate choice of outlier sum.

(d) From the table results that k > 2, it seems to be not e#cient to detect in"uential genes in all situations of (sh)6= (00).

We now consider an application of p value of outlier sum on a real gene data. The breast cancer microarray data reported by Huang et al. (2003) contained the expression levels of 12625 genes from 37 (or 52) breast tumor samples. Each sample had a binary outcome describing the status of lymph node involvement in breast cancer (breast cancer recurrence). Among them, 19 samples had no positive nodes. (Or 34 samples had no cancer recurrence and 18 samples had breast cancer recurrence). The gene expressions, obtained from the Aymetrix human U95a chip. We pre-processed the data using RMA (Irizarry et al. (2003)).

We rst compute thepvalues of (7.6) for various values ofk and we display the numbersno<0:001of genes that are classied to be signicant for that theirs p values are less than 0:001 in the following table.

Table 18.

Numbers of genes withp values smaller than 0:001

no<0:001 no<0:001

k = 1 5583 k = 4 35

k = 1:5 2407 k = 5 8

k = 2 922 k = 6 5

k = 3 158

We have several comments drawn from the results in Table 18:

(a) We have seen that ^Ha is the proposal of Wu (2007) and ^Hb with k = 1 is asymptotically equivalent to ^Ha when the underlying distribution is assumed

to be normal. The number of siginicant genes when k = 1 for ^Hb is 5583.

This huge number shows that this gene data is denitely not appropriate to be analyzed by the outlier sum proposals been introduced. The other cases with k  3 the numbers of genes claimed to be signicant are still too big for further investigation.

(b) When k is as large as 4 the number of siginicant genes is down to 35 and it further goes down to 8 when k = 5. This shows that gene data may need outlier sum of more extreme threshold to simplify the pothetial group of genes for further study.

In the following table, we select the cases k = 5 and 6 and list their corre-sponding gene numbers that are with signicant p values and the outlier sum values for reference.

Table 19.

Gene numbers with their outlier sums associated with p value

Gene number OS Gene number OS

k = 5 k = 6

4029 27:88125 4029 27:88125

4028 31:40937 4028 31:40937

10210 16:62765 10210 16:62765

3758 7:615114 3758 7:615114

8972 6:014273 8972 6:014273

10987 5:93685

10019 10:82669

198 10:14491

Detection of signicant genes through the p values of outlier sum solves the di#culty of classical outlier sum technique that is not not able to detect signicant genes when the number of them is not known. But how to decide constant k for the outlier sum of (8.1)? We propose to list the numbers of signicant genes for various values of k and select k for that has a moderate small group of signicant genes.

References

2004

Guide to the Expression of Uncertainty in Measurement

Supplement 1

Numerical Methods for the Propagation of Distributions

Draft of JCGM document. p. 38.

Chen, L.-A., Huang, J.-Y. and Chen, H.-C. (2007). Parametric coverage inter-val.

Metrologia

. 44, L7-L9.

Agrawal, D., Chen, T., Irby, R., et al. (2002). Osteopontin identied as lead marker of colon cancer progression, using pooled sample expression proling.

J. Natl. Cancer Inst.

94, 513-521.

Alizadeh, A. A., Eisen, M. B., Davis, R. E., et al. (2000). Distint types of diuse large B-cell lymphoma identied by gene expression proling.

Nature

, 403, 503-511.

Beer, D. G., Kardia, S. L., Huang, C. C., et al. (2002). Gene-expression proles predict survival of patients with lung adenocarcinoma.

Nat. Med.

, 8, 816-Chen, L.-A. and Chiang, Y.-C. (1996). Symmetric quantiles and trimmed824.

means for location and linear regression model.

Journal of Nonparametric Statistics.

7, 171-185.

Huang, E., Cheng, S. H., Dressman, H., et al. (2003). Gene expression predic-tors of breast cancer outcomes.

Lancet

, 361, 1590-1596.

Irizarry, R., Hobbs, B., Collin, F., Beazer-Barclay, Y., Antonellis, K., Scherf, U. and Speed, T. (2003). Exploration, normalization, and summarizes of high density oligonucleotide array probe level data.

Biostatistics

, 2, 249-64.

Ohki, R., Yamamoto, K., Ueno, S., et al. (2005). Gene expression proling of human atrial myocardium with atrial brillation by DNA microarray analysis.

Int. J. Cardiol.

102, 233-238.

Sorlie, T., Tibshirani, R., Parker, J., eta l. (2003). Repeated observation of breast tumor subtypes in independent gene expression data sets.

Proc.

Natl. Acad. Sci. U.S.A.

, 100, 8418-8423.

Ruppert, D. and Carroll, R. J. (1980). Trimmed least squares estimation in the linear model.

Journal of the American Statistical Association.

75, 828-838.

Tibshirani, R. and Hastie, T. (2007). Outlier sums dierential gene expression analysis.

Biostatistics

, 8, 2-8.

Tomlins, S. A., Rhodes, D. R., Perner, S., eta l. (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer.

Science

, 310, 644-648.

Wu, B. (2007). Cancer outlier dierential gene expression detection.

Biostatis-tics

,

相關文件