point
We have observed that the outlier proportion may have satisfactory power performance when we have consistent estimators ^X and ^ to construct test in (3.1). However, ^ involves estimations of density pointsfY andfX while estimation of density function at tail quantile points is extremely dicult in practice. Without an alternative proposal avoiding this density estimation, the outlier proportion based test won't be practically powerful in detection of inuential genes unless n1 and n2, the numbers of disease group subjects and number of normal group subjects, are very large.
In this section, we choose cuto point ^ = ^FX;1() for some > 0. For not being confused, we denote the outlier proportion as
^Y = 1n2
n2
X
i=1I(Yi F^X;1())
for estimating Y = P(Y FX;1()). We rst study the dierences of two population outlier proportions under the following distribution setting:
X Laplace(01)Y 0:9Laplace(01) + 0:1Laplace( ):
Table 4.
Population outlier proportionsX = 3
It is seen that the dierences between two population proportions are quite signicant when the quantile percentage is 0:9 or 0:95. This shows that using quantile as cuto point in detection of outliers is quite satisfactory.
A large sample theory for this quantile based outlier proportion is stated below.
Theorem 4.1.
Suppose that assumptions (A2) and (A3) in the Appendix are true. Then,n12=2(^Y ;Y) converges in distribution toN(0 Y2 ) where2Y =(1;)xyfY2(FX;1())fX;2(FX;1()) +Y(1;Y):
To construct a test statistic based on the above theorem, we still face the problem of requiring estimation of Y2 that involved prediction of density points fY(FX;1()) and fX(FX;1()) which is dicult unless there is huge
An approximate power for outlier proportion based on this quantile cuto
point at signicance level may be derived as bellows PFYfpn2(^Y ;^X
It is interested to compare outlier mean and outlier proportion both using quantile cuto point in terms of powers. First, we consider the following two location shift models:
Case 1:X N(01) andY N(1)
Case 2:X Laplace(01) and Y Laplace(1) We display the results of power in the following table.
Table 5
Approximate powers of outlier mean and outlier proportionPower = 1 = 2 = 4
In this location shift models, it still shows that the outlier proportion is better than the outlier mean. This further indicates the appropriateness of applying the outlier proportion in gene expression analysis.
With observation from Tomlins et al. (2005), it is interested to further investigate a power comparison when there is only a small percentage of outliers in distribution of Y. We evaluate the approximate power for the following two mixed distributions:
Case A :X Laplace(01)Y 0:7Lapace(01) + 0:3N(1) Case B :X t(5)Y 0:7t(5) + 0:3Laplace(1)
The results are listed in Table 6.
Table 6
Approximate powers of outlier mean and outlier proportionPower = 2 = 3 = 4
Case A
( = 0:85)pm 0:107 0:553 0:986
pp 0:634 0:809 0:839
( = 0:9)pm 0:086 0:252 0:504
pp 0:565 0:815 0:878
( = 0:95)pm 0:125 0:156 0:237
pp 0:424 0:690 0:881
Case B
( = 0:85)pm 0:335 0:926 0:999
pp 0:637 0:774 0:818
( = 0:9)pm 0:185 0:640 0:987
pp 0:623 0:805 0:858
( = 0:95)pm 0:177 0:205 0:458
pp 0:499 0:779 0:880
The approximate powers showing in Table 6 indicates that the outlier pro-portion is still a right choice in these distributional settings. Let us further consider one more distributional setting as
Mixed t :X t(10)Y 0:9t(10) + 0:1(2(10) +) for comparison. The results are displayed in Table 7.
Table 7
Approximate powers of outlier mean and outlier proportion for some mixed distributionsPower = 2 = 4 = 6
( = 0:9)pm 0:879 0:895 0:905
pp 0:873 0:953 0:960
( = 0:95)pm 0:873 0:892 0:903
pp 0:900 0:957 0:970
Both methods are with high powers in this distributional setting, however, the outlier proportion based test is still a better one.
5. Simulations Study
Suppose that now we have estimates ^X and ^ X for X and X respectively. A test based on quantile based outlier probability is stated in (4.1). Let ^X = n11
Pn1
i=1I(Xi F^X;1()), ^xy = nn21 and ^ X = (1 ;)^xy + ^Y(1; ^Y). A question is that is this practically a level
test?
Theoretically the critical point z is 1:645 when we expect the signi-cance level is 0:05. We conduct m = 100000 replications to simulate the following simulated probablity
pp = 1m
m
X
j=1I(n12=2(^Y ;^X
^X ) `) (5.1)
When we set`= 1:645 (5.1) represents the probability of type I error. with some distributions been used and various sample sizes that the results are displayed in the following table.
Table 8
. Simulated probability of type I error when z = 1:645sample size N(01) t(10) Laplace(01)
n= 30 0:1156 0:1178 0:1174
n= 50 0:1328 0:1327 0:1341
n= 100 0:1133 0:1125 0:1134
n= 200 0:1258 0:1238 0:1243
n= 500 0:1197 0:1211 0:1198
n= 1000 0:1285 0:1273 0:1264
n= 10000 0:1203 0:1213 0:1205
n= 100000 0:1199 0:1201 0:1198
Unfortunately (4.1) is not practically a level 0:05 test. We now, for each distribution, choose a constant ` such that (5.1) is approximately equal to 0:05 and then further to simulate the power of (5.1) under case I and case II distributions as follows
Case I: X N(01) and Y 0:9N(01) + 0:1(2(10) +) Case II: X t(10) and Y 0:9t(10) + 0:1(2(10) +):
The results are displayed in Table 9 and Table 10.
Table 9.
Power performance comparison by simulation (Case I)H0 = 2 = 4 = 6
Table 10.
Power performance comparison by simulation (Case II)H0 = 2 = 4 = 6 The outlier mean and outlier proportion techniques are both powerful in
these settings of distribution. More interestingly the outlier proportion is the more ecient method in this comparison.
6. Appendix
Three assumptions for the asymptotic representation of the sample outlier proportion test are as follows.
1. The limit xy =limn1n2!1nn21 exists.
2. Pobability density function fX of distribution FX is bounded away from
zero in neighborhoods of FX;1() for 2 (01) and the population cuto
point .
3. Probability density function fY is bounded away from zero in a neigh-borhood of the population cuto point .
Proof of theorem 3.1.
From the expression of ^Y in (3.1), we have n12=2(^Y;Y) =;n;12 =2Xn2
i=1I(Yi +n;11 =2Tn);I(Yi )]+n;12 =2Xn2
i=1(I(Yi );Y): (6.1)
where
Tn =n11=2(^;) =n11=2(2( ^FX;1(1;);F^X;1();(2FX;1(1;);FX;1())]: With assumption (3), the key in this proof is that
n;12 =2Xn2
i=1I(Yi+n;11 =2Tn);I(Yi )]
=;xy1=2fY()Tn+op(1) (6.2) which may seen in Ruppert and Carroll (1980) and Chen and Chiang (1996).
With the following representation of empirical quantile,
pn1( ^FX;1();FX;1())
=fX;1(FX;1())n;11 =2Xn1
i=1;I(XiFX;1())] +op(1) (6.3) (see, for example, Ruppert and Carroll (1980)), a Bahadur representation of the outlier proportion is induced from (6.1)-(6.3) as
n12=2(^Y ;Y) =n;11 =2Xn1
i=1(b1;(1;)b2)I(Xi FX;1()) +(b1+b2) I(FX;1()Xi FX;1(1;)) + (;(1;)b1+b2) I(Xi FX;1(1;))] +n;12 =2Xn2
i=1I(Yi );Y] +op(1):
The asymptotic distribution in Theorem 3.1 is induced from the Central Limit Theorem.
References
Chen, L.-A. and Chiang, Y. C. (1996). Symmetric type quantile and trimmed means for location and linear regression model. Journal of Nonpara-metric Statistics. 7, 171-185.
Chen, L.-A., Chen, Dung-Tsa and Chan, Wenyaw. (2010). The p Value for the Outlier Sum in Dierential Gene Expression Analysis. Biometrika, 97, 246-253.
Ruppert, D. and Carroll, R.J. (1980). Trimmed least squares estimation in the linear model. Journal of American Statistical Association
75
, 828-838.Tibshirani, R. and Hastie, T. (2007). Outlier sums dierential gene expres-sion analysis. Biostatistics,
8
, 2-8.Tomlins, S. A., Rhodes, D. R., Perner, S., eta l. (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer.
Science,
310
, 644-648.Wu, B. (2007). Cancer outlier dierential gene expression detection. Bio-statistics,