An Outlier Proportion Test With Empirical

point

We have observed that the outlier proportion may have satisfactory power performance when we have consistent estimators ^X and ^ to construct test in (3.1). However, ^ involves estimations of density pointsf_Y andf_X while estimation of density function at tail quantile points is extremely dicult in practice. Without an alternative proposal avoiding this density estimation, the outlier proportion based test won't be practically powerful in detection of inuential genes unless n¹ and n², the numbers of disease group subjects and number of normal group subjects, are very large.

In this section, we choose cuto point ^ = ^F_X^;1() for some > 0. For not being confused, we denote the outlier proportion as

^_Y = 1n²

n²

i⁼¹I(Yi F^_X^;1())

for estimating _Y = P(Y F_X^;1()). We rst study the dierences of two population outlier proportions under the following distribution setting:

X Laplace(01)Y 0:9Laplace(01) + 0:1Laplace( ):

Table 4.

Population outlier proportions

X = 3

It is seen that the dierences between two population proportions are quite signicant when the quantile percentage is 0:9 or 0:95. This shows that using quantile as cuto point in detection of outliers is quite satisfactory.

A large sample theory for this quantile based outlier proportion is stated below.

Theorem 4.1.

Suppose that assumptions (A²) and (A³) in the Appendix are true. Then,n¹²⁼²(^_Y ^;_Y) converges in distribution toN(0 _Y² ) where

²_Y =(1^;)_xyf_Y²(F_X^;1())f_X^;2(F_X^;1()) +_Y(1^;_Y):

To construct a test statistic based on the above theorem, we still face the problem of requiring estimation of _Y² that involved prediction of density points f_Y(F_X^;1()) and f_X(F_X^;1()) which is dicult unless there is huge

An approximate power for outlier proportion based on this quantile cuto

point at signicance level may be derived as bellows PF^Y^f^pn²(^_Y ^;^{^}_X

It is interested to compare outlier mean and outlier proportion both using quantile cuto point in terms of powers. First, we consider the following two location shift models:

Case 1:X N(01) andY N(1)

Case 2:X Laplace(01) and Y Laplace(1) We display the results of power in the following table.

Table 5

Approximate powers of outlier mean and outlier proportion

Power = 1 = 2 = 4

In this location shift models, it still shows that the outlier proportion is better than the outlier mean. This further indicates the appropriateness of applying the outlier proportion in gene expression analysis.

With observation from Tomlins et al. (2005), it is interested to further investigate a power comparison when there is only a small percentage of outliers in distribution of Y. We evaluate the approximate power for the following two mixed distributions:

Case A :X Laplace(01)Y 0:7Lapace(01) + 0:3N(1) Case B :X t(5)Y 0:7t(5) + 0:3Laplace(1)

The results are listed in Table 6.

Table 6

Approximate powers of outlier mean and outlier proportion

Power = 2 = 3 = 4

Case A

( = 0:85)pm 0:107 0:553 0:986

p_p 0:634 0:809 0:839

( = 0:9)p_m 0:086 0:252 0:504

pp 0:565 0:815 0:878

( = 0:95)pm 0:125 0:156 0:237

p_p 0:424 0:690 0:881

Case B

( = 0:85)p_m 0:335 0:926 0:999

pp 0:637 0:774 0:818

( = 0:9)pm 0:185 0:640 0:987

p_p 0:623 0:805 0:858

( = 0:95)p_m 0:177 0:205 0:458

pp 0:499 0:779 0:880

The approximate powers showing in Table 6 indicates that the outlier pro-portion is still a right choice in these distributional settings. Let us further consider one more distributional setting as

Mixed t :X t(10)Y 0:9t(10) + 0:1(²(10) +) for comparison. The results are displayed in Table 7.

Table 7

Approximate powers of outlier mean and outlier proportion for some mixed distributions

Power = 2 = 4 = 6

( = 0:9)p_m 0:879 0:895 0:905

pp 0:873 0:953 0:960

( = 0:95)pm 0:873 0:892 0:903

p_p 0:900 0:957 0:970

Both methods are with high powers in this distributional setting, however, the outlier proportion based test is still a better one.

5. Simulations Study

Suppose that now we have estimates ^_X and ^ _X for _X and _X respectively. A test based on quantile based outlier probability is stated in (4.1). Let ^_X = _n¹1

Pn¹

i⁼¹I(Xi F^_X^;1()), ^xy = ⁿ_n²1 and ^ X = (1 ^;)^_xy + ^_Y(1^; ^_Y). A question is that is this practically a level

test?

Theoretically the critical point z is 1:645 when we expect the signi-cance level is 0:05. We conduct m = 100000 replications to simulate the following simulated probablity

pp = 1m

j⁼¹I(n¹²⁼²(^_Y ^;^{^}_X

^X ) `) (5.1)

When we set`= 1:645 (5.1) represents the probability of type I error. with some distributions been used and various sample sizes that the results are displayed in the following table.

Table 8

. Simulated probability of type I error when z = 1:645

sample size N(01) t(10) Laplace(01)

n= 30 0:1156 0:1178 0:1174

n= 50 0:1328 0:1327 0:1341

n= 100 0:1133 0:1125 0:1134

n= 200 0:1258 0:1238 0:1243

n= 500 0:1197 0:1211 0:1198

n= 1000 0:1285 0:1273 0:1264

n= 10000 0:1203 0:1213 0:1205

n= 100000 0:1199 0:1201 0:1198

Unfortunately (4.1) is not practically a level 0:05 test. We now, for each distribution, choose a constant ` such that (5.1) is approximately equal to 0:05 and then further to simulate the power of (5.1) under case I and case II distributions as follows

Case I: X N(01) and Y 0:9N(01) + 0:1(²(10) +) Case II: X t(10) and Y 0:9t(10) + 0:1(²(10) +):

The results are displayed in Table 9 and Table 10.

Table 9.

Power performance comparison by simulation (Case I)

H⁰ = 2 = 4 = 6

Table 10.

Power performance comparison by simulation (Case II)

H⁰ = 2 = 4 = 6 The outlier mean and outlier proportion techniques are both powerful in

these settings of distribution. More interestingly the outlier proportion is the more ecient method in this comparison.

6. Appendix

Three assumptions for the asymptotic representation of the sample outlier proportion test are as follows.

1. The limit xy =limn¹n²^!1nn²¹ exists.

2. Pobability density function f_X of distribution F_X is bounded away from

zero in neighborhoods of F_X^;1() for ² (01) and the population cuto

point .

3. Probability density function fY is bounded away from zero in a neigh-borhood of the population cuto point .

Proof of theorem 3.1.

From the expression of ^Y in (3.1), we have n¹²⁼²(^Y^;Y) =^;n^;1² ⁼²^Xⁿ²

i⁼¹I(Yi +n^;1¹ ⁼²Tn)^;I(Yi )]+n^;1² ⁼²^Xⁿ²

i⁼¹(I(Yi )^;Y): (6.1)

where

Tn =n¹¹⁼²(^^;) =n¹¹⁼²(2( ^F_X^;1(1^;)^;F^_X^;1()^;(2F_X^;1(1^;)^;F_X^;1())]: With assumption (3), the key in this proof is that

n^;1² ⁼²^Xⁿ²

i⁼¹I(Y_i+n^;1¹ ⁼²T_n)^;I(Y_i )]

=^;_xy¹⁼²f_Y()T_n+o_p(1) (6.2) which may seen in Ruppert and Carroll (1980) and Chen and Chiang (1996).

With the following representation of empirical quantile,

pn¹( ^F_X^;1()^;F_X^;1())

=f_X^;1(F_X^;1())n^;1¹ ⁼²^Xⁿ¹

i⁼¹^;I(XiF_X^;1())] +op(1) (6.3) (see, for example, Ruppert and Carroll (1980)), a Bahadur representation of the outlier proportion is induced from (6.1)-(6.3) as

n¹²⁼²(^_Y ^;_Y) =n^;1¹ ⁼²^Xⁿ¹

i⁼¹(b¹^;(1^;)b²)I(X_i F_X^;1()) +(b¹+b²) I(F_X^;1()X_i F_X^;1(1^;)) + (^;(1^;)b¹+b²) I(X_i F_X^;1(1^;))] +n^;1² ⁼²^Xⁿ²

i⁼¹I(Y_i )^;_Y] +o_p(1):

The asymptotic distribution in Theorem 3.1 is induced from the Central Limit Theorem.

References

Chen, L.-A. and Chiang, Y. C. (1996). Symmetric type quantile and trimmed means for location and linear regression model. Journal of Nonpara-metric Statistics. 7, 171-185.

Chen, L.-A., Chen, Dung-Tsa and Chan, Wenyaw. (2010). The p Value for the Outlier Sum in Dierential Gene Expression Analysis. Biometrika, 97, 246-253.

Ruppert, D. and Carroll, R.J. (1980). Trimmed least squares estimation in the linear model. Journal of American Statistical Association

75

, 828-838.

Tibshirani, R. and Hastie, T. (2007). Outlier sums dierential gene expres-sion analysis. Biostatistics,

8

, 2-8.

Tomlins, S. A., Rhodes, D. R., Perner, S., eta l. (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer.

Science,

310

, 644-648.

Wu, B. (2007). Cancer outlier dierential gene expression detection. Bio-statistics,

8

, 566-575.

在文檔中離群值比例之基因分析 (頁 14-0)