• 沒有找到結果。

分類變數的統計推論與其截尾平均數的應用

N/A
N/A
Protected

Academic year: 2021

Share "分類變數的統計推論與其截尾平均數的應用"

Copied!
20
0
0

加載中.... (立即查看全文)

全文

(1)

Its Application to a Trimmed Mean By Dr. Lin-An Chen and Tzu-Yun Huang Institute of Statistics, National Chiao Tung University,

Hsinchu, Taiwan.

Abstract

Considerable energy has been devoted to the arguments of possible unde-sired statistical inference properties resulted from categorization of contin-uous variables that does not stop its popularity in association research of epidemiology for its appealing of convenience in presentation and interpre-tation of analyzed results. For correction of popularly used untrustworthy statistical methods, we initiate a theoretical study of statistical effect of categorization with parametric and nonparametric estimations for unknown means of categorized variables. We show that the parametric sample mean is very efficient that explains undesired statistical property of classical sta-tistical methods. In nonparametric estimation of the population mean of a (noncategorized) variable, we prove that categorization creates auxiliary information to improve the efficiency of parameter estimation. This shows that the statistical society is far from knowing the statistical properties of categorization and the supplementary population information of an extra variable created by categorization for statistical inferences deserves to re-ceive more attention in literature.

Key words: Auxiliary information; auxiliary variable; categorization of con-tinuous variable; estimation; trimmed mean.

1. Introduction

It is very common that the researchers are interested in assessing the relationship between continuous outcome and explanatory (covariates) vari-ables. In contemporary epidemiologic practice, it is appealing to epidemiol-ogist to modify continuous variables into categorical variables to facilitate

(2)

data presentation such as low, medium and high risk group. Prock et al. (2004) reported that 84% of epidemiological papers from leading journals made categorization of continuous variables. Categorization of continuous variables is widespread to other areas, for examples, psychology application (MacCallum, et al. (2002)) and marketing application (Irwin and McClel-land (2003)). The investigators often then use categorized samples to fit a regression model or to analyze whether subsequently higher categories are associated with increased risk of an outcome by multiple comparison method.

Categorization of continuous variables has been overwhelmingly criticized for problems of undesired statistical properties such as bias in estimation and power loss in hypothesis testing caused by loss information (see, for examples, Royston and Aauerbrei (2008), Taylor and Yu (2002), Walraven and Hart (2008) and Zhao and Kolonel (1992)). As observed by Han (2008), the assumptions of normality, independence and constant variance behind the multiple comparison for these categorized variables are not true (see also Bennette and Vickers (2012)), and these undesired statistical properties, in our opinion, are not prevented when theoretically untrustworthy statistical inference methods are applied.

With categorization still playing important role, it requires theoretically trustworthy statistical methods to deal with categorized samples. We initi-ate this study by developing distributional theory for parametric and non-parametric categorized sample means. With novel idea of parametrization, the parametric estimation outperforms the classical sample means with very high efficiency. We also observed a surprising and exceptional new statis-tical theory that the debated categorization creates the desired auxiliary information.

In the statistical inference for unknown parameters of a variable’s distri-bution, any extra variable measured in association with this variable that is used to increase the accuracy of this inference is called an auxiliary vari-able. By showing that a categorized trimmed mean for estimating pop-ulation mean of uncategorized variable has asymptotic variance not only

(3)

smaller than that of the classical trimmed mean but also, more interest-ingly, smaller than the Cramer-Rao lower bound for this population mean, an evidence of categorization creating auxiliary information is discovered. The knowledge in literature is slim in terms of how much categorization con-tributes the accuracies in statistical parameter inferences, especially when categorization’s auxiliary information is implemented. This approach has taken the first step to recognize the theory of categorization but there is much more waiting for further investigation.

2. Nonparametric Categorized Sample Means

Let Y and X be continuous response and explanatory variables with a joint probability density function (pdf) fXY(x, y). Consider cutoffs −∞ <

a1 < a2 < ... < ak−1such that intervals A1 = (−∞, a1] , A2 = (a1, a2], ..., Ak=

(ak−1, ∞) forms a partition of the space of variable X. Cutoffs aj’s are seen

as known constants and unknown quantiles in practice. Suppose that we have a random sample Y1

X1  , Y2 X2  , ..., Yn Xn 

from this underlying dis-tribution. The epidemiologists often categorize the sample of variable Y into the following categorized samples, as

{Yi : Xi ∈ A1, i = 1, ..., n}, . . . , {Yi : Xi ∈ Ak, i = 1, ..., n} . (2.1)

Then classical statistical methods such as t-test and F -test based on these categorized samples for inference of unknown population means θ1 = E[Y |X ∈

A1], ..., θk = E[Y |X ∈ Ak], called the categorized means, are applied to

ver-ify the relationship between categorized variables. As observed from Han (2008), these categorized variables in k groups are no longer normal, inde-pendent and constant variance and then these classical tests are theoretically incorrect.

Theoretically trustworthy inference methods may be developed from the distributional theory of their parametric and nonparametric estimators. We consider the nonparametric estimation in this section. For constant cutoffs, the following averages

ˆ θcj = Pn i=1YiI(Xi ∈ Aj) Pn i=1I(Xi ∈ Aj) , j = 1, ..., k, (2.2)

(4)

called the categorized sample means, are applied without correct distribu-tional theory, in classical ANOVA approach. Here c stands for constant cutoff. Denoting ˆθC = (ˆθc1, ..., ˆθck)0, a nonparametric estimator of vector

categorized group means θ = (θ1, ..., θk)0, the following theorem states its

distributional theory. Theorem 2.1. n1/2θ

C − θ) converges in distribution to k-dimensional

multivariate normal distribution Nk(0k, ΣC) where

ΣC = Cov      p−11 (Y − θ1)I(X ≤ a1) p−12 (Y − θ2)I(X ∈ (a1, a2)) .. . p−1k (Y − θk)I(X ≥ ak)      = diag(σ12, σ22, ..., σk2)

and where σ2j = P (X∈A1

j)V ar(Y |X ∈ Aj) is the categorized variance.

If choices in nonparametric estimation are available, we recommend the constant cutoffs since estimation of asymptotic covariance matrix is simpler to establish. We define estimator of categorized variance σ2

j by Sj2 = Pn 1 i=1I(Xi ∈ Aj) n X i=1 (Yi− ˆθj)2I(Xi ∈ Aj), j = 1, ..., k,

calling them the categorized sample variances. Hence the following sample matrix ˆ ΣC = diag(S12, S 2 2, ..., S 2 k)

consitutes a consistent estimator of unknown covariance matrix ΣC. Its

efficiency will be verified later.

In many applications (Shankar et al. (2007), Luo et al. (2007) and Letenneur et al. (2007)), categorization is done on quantile partition as

A1 = (−∞, FX−1(α1)], A2 = (FX−1(α1), FX−1(α2)], ..., Ak = (FX−1(αk−1), ∞)

where FX−1(α) represents the α-th population quantile of random variable X. Frequently the quantile functions FX−1(αj)0s are unknown and are

esti-mated with empirical quantiles ˆFX−1(αj), j = 1, ..., k − 1, of random sample

X1, ..., Xn. This leads to the sample partition

ˆ

(5)

and the quantiles based categorized sample means as ˆ θqj = Pn i=1YiI(Xi ∈ ˆAj) Pn i=1I(Xi ∈ ˆAj) , j = 1, ...k (2.3)

where q stands for quantile cutoff. We denote ˆθq = (ˆθq1, ..., ˆθqk)0 and λj =

E(Y − µy|X = FX−1(αj)). A representation and asymptotic distribution for

this categorized sample mean vector ˆθq are introduced below.

Theorem 2.2. (a) The quantile cutoffs based categorized sample mean has the following Bahadur representation:

√ n(ˆθq−θ) = n−1/2 n X i=1     α−11 (ψ1(Yi, Xi) − E(ψ1(Y, X))) (α2− α1)−1(ψ2(Yi, Xi) − E(ψ2(Y, X))) .. . (1 − αk−1)−1(ψk(Yi, Xi) − E(ψk(Y, X)))     +op(1) where ψ1(Y, X) =  Y − µy if X ≤ FX−1(α1) λ1 if X > FX−1(α1) , ψk(Y, X) =  λk−1 if X ≤ FX−1(αk−1) Y − µy if X > FX−1(αk−1) and, for j = 2, ..., k − 1, ψj(Y, X) =    λj−1 if X ≤ FX−1(αj−1) Y − µy if FX−1(αj−1) < X < FX−1(αj) λj if X ≥ FX−1(αj) .

(b) We have that √n(ˆθq − θ) is asymptotically normal with distribution

Nk(0k, Σq) where k × k matrix Σq = (σjm), j, m = 1, ..., k with

σ11 = 1 α2 1 {M1+ (1 − α1)λ21− (m1+ (1 − α1)λ1)2}, σ1j = 1 α1(αj − αj−1) {λ1λj−1(αj−1− α1) + λ1mj + λ1λj(1 − αj) + λj−1m1− (m1+ (1 − α1)λ1)(αj−1λj−1+ (1 − αj)λj+ mj)} σ1k = 1 α1(1 − αk−1) {λ1λk−1(αk−1− α1) + λ1mk+ λk−1m1 − (m1+ (1 − α1)λ1)(αk−1λk−1+ mk)}

(6)

for j = 2, ..., k − 1, σjj = 1 (αj− αj−1)2 {αj−1λ2j−1+ (1 − αj)λ2j + Mj − (αj−1λj−1 + (1 − αj)λj + mj)2}, σjj+1 = 1 (αj+1− αj)(αj− αj−1) {λj(λj−1αj−1+ mj + mj+1+ λj+1(1 − αj+1)) − (αj−1λj−1+ (1 − αj)λj + mj)(αjλj + (1 − αj+1)λj+1+ mj+1)}, σjm = 1 (αj− αj−1)(αm− αm−1) {λm−1(λj−1αj−1+ mj+ λj(αm−1− αj)) + λj(mm+ λm(1 − αm)) − (αj−1λj−1+ (1 − αj)λj + mj) (αm−1λm−1+ (1 − αm)λm+ mm)}, m = j + 2, ..., k − 1, σjk = 1 (αj− αj−1)(1 − αk−1) {λk−1(λj−1αj−1+ mj + λj(αk−1− αj)) + λjmk− (αj−1λj−1+ (1 − αj)λj+ mj)(αk−1λk−1 + mk)}, and σkk= 1 (1 − αk−1)2 {αk−1λ2k−1+ Mk− (αk−1λk−1+ mk)2}.

where mj = E[(Y − µy)I(X ∈ Aj)] and Mj = E[(Y − µy)2I(X ∈ Aj)], for

j = 1, ..., k denote the first and second central group moment at jth group for categorized variable Y ’s.

This theorem generalizes the theory of univariate robust trimmed mean and the outlier mean of Chen, Chen and Chan (2010) to vector case.

Theorems 2.1 and 2.2 provide a basis for theoretically correct nonpara-metric inferences for unknown parameters θ.

3. Parametric Categorized Sample Means

We consider the parametric approach with normality assumption as  Y X  ∼ N2(  µy µx  , σ 2 y σxy σyx σ2x  ). (3.1)

In this normal setting, we fix a permutation of distributional parameters as Λ = (µx, µy, σy2, σ2x, σyx) and consider for simplicity of presentation only

(7)

quantile cutoffs. Given 0 < α1 < α2 < ... < αk−1 < 1, the unknown

quantiles under normality assumption are aj = Fx−1(αj) = µx + zαjσx,

j = 1, ..., k − 1. The following theorem is a simplified result from Han (2005).

Theorem 3.1. Consider the quantiles cutoffs and the normal assumption (3.1). The population categorized means forms a vector θp = (θ1p, ..., θkp)

with θ1p= µy − σyx α1σx φ(zα1) .. . θjp = µy− σyx σx(αj− αj−1) (φ(zαj) − φ(zαj−1)), j = 2, ..., k − 1 .. . θkp = µy+ σyx σx(1 − αk−1) φ(zαk−1)

where φ is the probability density function (pdf) of standard normal N (0, 1). Suppose that we have a random sample Y1

X1  , Y2 X2  , ..., Yn Xn  . De-noting ¯X = n1 Pn i=1Xi, ¯Y = 1 n Pn i=1Yi, S 2 Y = 1 n Pn i=1(Yi − ¯Y ) 2, S2 X = 1 n Pn i=1(Xi − ¯X) 2 and S Y X = n1 P n

i=1(Xi − ¯X)(Yi − ¯Y ), the mle of

pa-rameter vector Λ is ˆ

Λmle = ( ¯X, ¯Y , SY2, S 2

X, SY X) (3.2)

and the maximum likelihood estimators (mle) of categorized means are ˆ θ1p= ¯Y − syx α1sx φ(zα1), ˆθjp= ¯Y − syx sx(αj− αj−1) φ(zαj) − φ(zαj−1)), j = 2, ..., k − 1, ˆθkp = ¯Y + syx sx(1 − αk−1) φ(zαk−1)

and its vector ˆθp = (ˆθ1p, ..., ˆθkp)0.

Theorem 3.2. (a) We have that n1/2(ˆθp− θp) converges in distribution to

(8)

matrix Σp = Γ(Λ)Vp(Λ)Γ(Λ)0 where Γ(Λ) =

∂θp(Λ)

∂Λ is the partial derivative

of θp(Λ) with respect to Λ and Vp(Λ) = −[E ∂2lnφ

N(X,Y )

∂Λ∂Λ0 ]−1 is the

Crammer-Rao’s lower bound for Λ with φN(X, Y ) the pdf of normal distribution in

(3.1).

(b) The quantiles based cutoffs based partial derivative matrix under the normal distribution is Γ(Λ) = (γij)i=1,...,k,j=1,...,5 with

γ11 = − σxy σ2 x φ(zα1) α1 (φ(zα1) α1 + zα1), γ12 = 1, γ13 = − σxy 2σ3 x φ(zα1) α1 (zα1φ(zα1) α1 + zα2 1 − 1), γ14 = 0, γ15 = − 1 σx φ(zα1) α1 for j = 2, ..., k − 1, γj1 = −σxy σ2 x 1 αj − αj−1 ((φ(zαj) − φ(zαj−1)) 2 αj− αj−1 + (zαjφ(zαj) − zαj−1φ(zαj−1))), γj2 = 1, γj3= −σxy 2σ3 x 1 αj − αj−1 [(zαjφ(zαj) − zαj−1φ(zαj−1) αj− αj−1 − 1) (φ(zαj) − φ(zαj−1)) + (z 2 αjφ(zαj) − z 2 αj−1φ(zαj−1))], γj4 = 0, γj5 = −1 σx φ(zαj) − φ(zαj−1) αj− αj−1 , γk1 = σxy σ2 x φ(zαk−1) 1 − αk−1 (zαk−1 − φ(zαk−1) 1 − αk−1 ), γk2 = 1, γk3 = σxy 2σ3 x φ(zαk−1) (1 − αk−1 (z2α k−1− zαk−1φ(zαk−1) 1 − αk−1 − 1) γk4 = 0, γk5 = 1 σx φ(zαk−1) 1 − αk−1 . (c) Defining 2×2 matrix A = σ 2 x σxy2 σxy2 σy2  and 3×3 matrix B =   2σ4xxy2 2σx2σxy 2σxy2 2σy4 2σx2σxy 2σ2 xσxy 2σy2σxy σx2σ2y+ σxy2  , the Cramer-Rao lower bound for bivariate normal parameter vector Λ is

Vp(Λ) = −[E ∂2lnφN(X, Y ) ∂Λ∂Λ0 ] −1 =  A 02×3 03×2 B  .

Further parametric statistical inferences for categorized means θp can

be constructed with the mle of asymptotic covariance matrix as ˆΣp =

Γ( ˆΛmle)Vp(Λmle)Γ( ˆΛmle)0.

(9)

We would not investigate the accuracies of theoretically correct infer-ence methods constructed by the parametric and nonparametric categorized sample means but would desire at this moment to compare the accuracies of these two estimation methods. We first compare their asymptotic covariance matrices by evaluating the traces of covariance matrices Γpq(Λ)Vp(Λ)Γ0pq(Λ)

and Σc to compute the relative efficiencies of the nonparametric estimator

of categorized group means as ef fN = min{tr(Γpq(Λ)Vp(Λ)Γ0pq(Λ)), tr(Σc)} tr(Σc) . Considering  Y X  ∼ N2(  1 1  , σ 2 y σyx σyx σx2 

), the efficiency values ef fN

under several values of parameters are displayed in Table 1.

Table 1. Efficiencies of nonparametric estimator of categorized group means

σyx = 0.2 0.3 0.5 0.7 0.8

(σ2y, σx2) = (1, 1) 0.556 0.599 0.672 0.701 0.675 (σ2y, σx2) = (2, 1) 0.530 0.561 0.621 0.671 0.689 (σ2y, σx2) = (1, 2) 0.530 0.561 0.621 0.670 0.689 Lower values of ef fN supports the parametric estimation of unknown

cat-egorized means when the underlying distribution is known. Since method of nonparametric categorized sample mean is applied for classical ANOVA analysis, this parametric estimation from the new parametrized unknown categorized means in Theorem 3.1 deserves attention in application and study with construction of new ANOVA approach of multiple comparison of categorized means.

Setting the normal distribution N2(

 1 2  , σ 2 y σyx σyx σx2  ), we compare ˆ

Σp(Λ) and ˆΣC through simulation for their efficiencies of estimating the

common matrix Σ. Suppose that the categorized sample variance at jth replication be denoted as Sj = (sji`)i,`=1,...,k and true covariance matrix is

Σ = (σi`)i,`=1,...,k. We define mean squares error (MSE) by

M SE = 1 mk2 m X j=1 k X `=1 k X i=1 (Si`j − σi`)2

(10)

We denote the MSE’s for nonparametric and parametric estimators, re-spectively, by M SEnp and M SEp. With replications m = 10, 000, sample

size n = 50 and 100 and some values of variances σ2y and σx2, the results of two MSE’s are displayed in Table 2.

Table 2. MSE’s for parametric estimator of asymptotic covariance matrix ΓVpΓ0 (4 groups) Sample size σyx = 0.2 0.3 0.5 0.7 0.8 (σ2x, σy2) = (1, 1) n = 50, M SEnp 4.441 4.422 3.647 2.194 1.338 M SEp 0.079 0.074 0.049 0.023 0.012 n = 100, M SEnp 2.208 2.322 2.148 1.323 0.800 M SEp 0.040 0.036 0.024 0.011 0.006 (σ2x, σy2) = (2, 1) n = 50, M SEnp 4.725 4.559 4.528 3.673 3.572 M SEp 0.083 0.081 0.067 0.050 0.041 n = 100, M SEnp 2.153 2.240 2.376 2.350 2.293 M SEp 0.042 0.040 0.034 0.025 0.020 (σ2x, σy2) = (1, 2) n = 50, M SEnp 18.61 18.41 17.41 14.86 13.18 M SEp 0.339 0.312 0.268 0.201 0.164 n = 100, M SEnp 8.649 8.990 9.321 8.627 7.800 M SEp 0.165 0.157 0.134 0.103 0.081

The simulated results show that estimation of asymptotic covariance matrix of parametric categorized sample means is much more efficient than that of nonparametric version.

We next consider a simulation study to verify the finite sample efficiencies of parametric and nonparametric estimators of parameter vector θp when Y

and X have a joint normal distribution. Denoting ˆθjN and ˆθpj as, respectively, nonparametric and parametric estimates of θ at jth replication, we compute the following MSE’s

M SEN = 1 m m X j=1 (ˆθNj − θp)0(ˆθjN − θp), M SEp = 1 m m X j=1 (ˆθjp− θp)0(ˆθpj − θp)

and the simulated results are displayed in Table 3 where categorization number is 4.

(11)

Table 3. MSE’s for parametric and nonparametric estimations Sample size σyx = 0.2 0.3 0.5 0.7 0.8 n = 30 M SEN 0.145 0.137 0.118 0.087 0.067 M SEp 0.063 0.059 0.051 0.036 0.027 n = 50 M SEN 0.082 0.079 0.066 0.049 0.038 M SEp 0.036 0.034 0.029 0.021 0.016 n = 100 M SEN 0.039 0.038 0.032 0.023 0.018 M SEp 0.018 0.017 0.014 0.010 0.007

We see that M SEp’s are all relatively smaller than corresponding M SEN’s

that supports our previous observation of superiority of parametric estima-tion for populaestima-tion categorized means.

Let ˆθ be an estimator of categorized group mean θ that satisfies n1/2θ −

θ) converging in distribution to normal vector Nk(0k, Σ). Suppose that

consistent estimator ˆΣ for Σ is available. Then following quantity

T = n(ˆθ − θ)0Σˆ−1(ˆθ − θ) (4.1) converges, in distribution, to chi-squares distribution χ2(k) of degrees of freedom k when sample size n goes to infinity. Theoretically correct inference methods such as confidence band and test for general linear hypothesis H0 :

Aθ = 0 vs H1 : Aθ 6= 0 may be constructed through a chi-square quantity.

One way to investigate these inference methods is to compare T like quantity in some way for different approaches. Let the areas of estimated region (ˆθ − θ)0Σˆ−1(ˆθ − θ) for parametric and nonparametric versions be denoted by ˆAp and ˆAN. We denote their MSE’s as

M SEp = 1 m m X j=1 ( ˆA(j)p − Ap)2, M SEN = 1 m m X j=1 ( ˆA(j)N − AN)2

where (j) refers to jth replication. With m = 10, 000, categorization number is 2, we display the simulated results in Table 4.

(12)

MSE σyx = 0.2 0.3 0.5 0.7 0.8 (σ2 y, σx2) = (1, 1), n = 50 M SEN 10.35 7.016 3.462 2.181 2.096 M SEp 1.798 1.643 1.185 0.628 0.387 n = 100 M SEN 9.213 6.103 2.793 1.778 1.746 M SEp 0.876 0.822 0.582 0.314 0.198 (σ2y, σx2) = (1, 2), n = 50 M SEN 12.97 9.929 5.654 3.295 2.562 M SEp 1.836 1.757 1.539 1.162 0.976 n = 100 M SEN 11.70 8.800 2.560 2.675 2.052 M SEp 0.923 0.851 0.993 0.578 0.493 (σ2y, σx2) = (2, 1), n = 50 M SEN 51.96 39.67 22.69 14.13 11.34 M SEp 7.375 7.084 5.977 4.724 4.048 n = 100 M SEN 46.82 35.00 19.49 11.31 9.184 M SEp 3.795 3.577 2.998 2.378 2.010

Accuracy in estimation of unknown parameters gives the parametric sample categorized means the advantage of smaller area of interest. This is another desired property for parametric estimation of unknown categorized means.

In brief summary, attractive properties shown above for parametric es-timation is benefited from the new and novel parametrization in Theorem 3.1.

5. Categorization Creating Auxiliary information

In this section, we show that categorization is linked to a theory very important in efficient estimation. Statistician has long been interested in looking for inference method with possible improvement of accuracy. Let Y1, ..., Yn be a random sample from a density function f (y, θy) with θy being

the interest of parameter. We know that Cramer-Rao’s theory gives us no chance in improving an uniformly minimum variance unbiased estimator when regularity conditions are assumed. Researchers then turned to find estimator sequence {ˆθy} asymptotically normal as

(13)

in distribution that has superefficient point θy in parameter space as

vθy < I(θy)

−1

(5.2) where I(θy) is the Fisher information at θy. In 1951, Hodges produced an

estimator (Bickel, et al. (1998)) with one superefficient point. Later, Le Cam (1953) showed that for any sequence of estimators satisfying (5.1), the set of superefficient points has Lebesgue measure zero. This also tells us that estimators with superefficiency is only interesting theoretically but not in practice.

An interest in theory and practice is then looking for a statistic contain-ing auxiliary information so that it improves inference’s accuracy. Verifycontain-ing existence of auxiliary information has received some attention in literature, see, for examples, Kuk and Mak (1989), Rao, Kovar and Mantel (1990) and Martinez-Miranda, Rueda and Arcos (2007) for quantile estimation and Srivastava (1971) for mean estimation. We prove that categorization con-tributes this improvement. In robust estimation of population mean µy, the

classical trimmed mean based on random sample Y1, ..., Yn is defined as

ˆ µt(α1, α2) = Pn i=1YiI( ˆF −1 Y (α1) ≤ Yi ≤ ˆF −1 Y (α2) Pn i=1I( ˆF −1 Y (α1) ≤ Yi ≤ ˆFY−1(α2)) . (5.3)

Now, suppose that as in our design for categorization we also have an extra random sample X1, ..., Xn with Yi and Xi correlated. For 0 < α1 < α2 < 1,

we call the following categorized sample mean

ˆ µy,cat(α1, α2) = Pn i=1YiI( ˆF −1 X (α1) ≤ Xi ≤ ˆF −1 X (α2) Pn i=1I( ˆF −1 X (α1) ≤ Xi ≤ ˆFX−1(α2)) (5.4)

the categorized trimmed mean. This is first example of estimating a distri-butional parameter of uncategorized variable with estimator based on cate-gorized sample. We prove that categorization creates auxiliary information for robust estimation.

The following theorem with α1 = 1 − α2 = α is a direct result from

(14)

Theorem 5.1. Suppose that the joint distribution of Y and X is spherically symmetric.

(a) The Barhadur representation for the categorized trimmed mean is √ n(ˆµy,cat(α, 1 − α) − µy) = 1 1 − 2αn −1/2 n X i=1 ψ0(Yi, Xi) + op(1) where ψ0(Y, X) =    −λ1−α if X ≤ FX−1(α) Y − µy if FX−1(α) < X < FX−1(1 − α) λ1−α if X ≥ FX−1(1 − α) .

(b) Then√n(ˆµy,cat(α, 1−α)−µy) is asymptotically normal with distribution

N (0, σcat2 ) where

σcat2 = 1

(1 − 2α)2[2αλ 2

1−α+ Mα]

where Mα = E[(Y − µy)2I(FX−1(α) < X < FX−1(1 − α))].

We also denote the asymptotic variance of the classical trimmed mean of (5.1) by σ2

t (Ruppert and Carroll (1980)).

For verification of our assertion, we design the following setting of mixed distribution:  Y X  ∼ (1 − δ)N2(  µy µx  ,  1 σ12 σ12 1  ) + δN2(  µy µx  , σ 2 y σ∗12 σ12∗ 1  )

indicating the interest of parameter is µy = E(Y ). For each setting of

the distributional parameters, we compute the minimum σt2 and σ2cat and display them in Tables 5 and 6. Note that the values in parentheses in Table 5 are the trimming proportions achieving smallest asymptotic variances and I(µy)−1 represents Y -variable based inverse of Fisher information and, for

theoretical interest, we list σ2

cat’s when σcat2 < I(µy)−1 holds.

(15)

σ2 cat (σ12 = −0.9) σ2 cat (−0.8) σ 2 t I(µy)−1 σ2 y = 2 σ12∗ = 1 1.042(0.06) 1.067(0.05) 1.107(0.05) 1.093 1.4 0.951(0.13) 1.003(0.08) σ2y = 5 σ12∗ = 2 1.131(0.15) 1.202(0.11) 1.230(0.06) 1.220 2.2 1.019(0.2) 1.118(0.14) σ2y = 9 σ12∗ = 2.9 1.150(0.23) 1.296(0.09) 1.256 2.99 1.053(0.26) 1.215(0.2)

Table 6. Comparison of asymptotic variances (δ = 0.2)

σ12 = 0.5 0.7 0.9 0.99 I(µy)−1 σy2 = 2 σ12∗ = −1 1.179 1.128 1.042 0.974 1.185 −1.4 1.084 0.985 0.788 0.571 σy2 = 5 σ12∗ = −2.2 1.416 1.228 0.892 0.576 1.448 −2.22 1.401 1.205 0.853 0.494 σy2 = 9 σ12∗ = −2.9 1.205 0.896 1.532 −2.99 1.460 0.949 0.460

We have several comments for the results in Tables 5 and 6:

(a) Without extra information from other variables, the classical trimmed mean gains no benefit in outperforming the lower bound I(µy) in any case

while, in the designed distributions in terms of variances and covariances, the categorized trimmed means outperform the corresponding lower bounds. (b) We see the power of auxiliary information that σ2cat can be as small as 0.46 when the lower bound is 1.532. Auxiliary information greatly improves in reduction of asymptotic variance of trimming estimation.

(c) The fact that the set {µy : σcat2 < I(µy)−1, µy ∈ R} is Lebesgure measure

greater than zero supports the consideration of using auxiliary information to modify statistical inference methods.

(d) The auxiliary information exists in this robust estimation when the extra variable has relatively smaller variance and is highly correlated with the response variable. This meets the general understanding in literature.

(16)

(e) Not every extra variable X provides auxiliary information. Suppose that we have the following normality assumption:

 Y X  ∼ N2( µy µx  , σ 2 y σ12 σ12 σy2  ).

Let Iyx(µy)−1be the inverse of Fisher information for µy that is derived from

the above bivariate distribution. We may see that Iyx−1(µy) = I−1(µy) = σ2y

indicating that auxiliary information does not exists when bivariate normal is true.

6. Concluding Remarks

For predicting the unknown population means of categorized variables, we have derived distributional theory for parametric and nonparametric es-timators that allows us to construct ”theoretically correct” and ”advanced” inference methods. Both approaches are shown to be valuable in statistical theory and application. The novel parametrization results in the parametric estimators being much more efficient than the classical ANOVA used sample means. On the other hand the nonparametric categorized sample mean is found involving an auxiliary information that greatly improves the efficiency of nonparametric robust estimation. Ignorance to theory investigation for categorization not only blindly face the use of untrustworthy inference meth-ods but also forfeit the chance to discover interesting information created by categorization for inference improvement. For long being criticized, we have finally taken a big step in knowing it but it deserves to receive more attention in statistical society.

We have several further remarks on this research:

(a) Idea of parametrization provides efficient parametric inference techniques for unknown distributional parameters of categorized variables. Extension of this parametrized parametric approach is desired to non-normal distribu-tion.

(b) The accuracy properties of confidence interval and hypothesis testing method formulated by quantity (4.1) requires for further investigation where testing the usual interest of equal means can be done by setting general linear

(17)

hypothesis with A =     1 −1 0 0 ... 0 0 0 1 −1 0 ... 0 0 .. . ... ... ... ... ... ... 0 0 0 0 ... 1 −1     .

(c) Intuitively we can expect that some other robust estimation methods such as Winsorized mean in L-estimation and Huber’s M-estimation can be benefited with efficiency improvement when the categorization created auxiliary information is applied.

(d) It is expected that asymptotic variance σcat2 can be even reduced more when multiple auxiliary information is used. The Rao-Blackwell theorem like theory as for how much capacity of improvement could be reached is an interesting theoretical problem.

7. Appendix

We only give the proof for Theorem 2.2 while Theorem 3.1 are induced from Han (2008) and the proofs for Theorems 2.1 and 3.2 being straightfor-ward are skipped.

Proof of Theorem 2.2. With quantile cutoffs, the sample group means may be represented as ˆ θqj − µy = Pn i=1(Yi− µy)I( ˆF −1 X (αj−1) ≤ Xi ≤ ˆF −1 X (αj)) Pn i=1I( ˆF −1 X (αj−1) ≤ Xi ≤ ˆFX−1(αj)) . (7.1)

Following the approaches of Ruppert and Carroll (1980) and Chen and Chi-ang (1996), we may see that

n−1/2 n X i=1 (Yi− µy)[I(Xi ≤ FX−1(α) + n−1/2Tn) − I(Xi ≤ FX−1(α)] = E(Y − µy|FX−1(α))fX(FX−1(α))Tn+ op(1) (7.2)

(18)

FX−1(α) + n−1/2TX) with TX = √ n( ˆFX−1(α) − FX−1(α)) and (7.2) gives n−1/2 n X i=1 (Yi− µy)I( ˆFX−1(αj−1) ≤ Xi ≤ ˆFX−1(αj)) (7.3) = [E(Y − µy|X = FX−1(αj))fX(FX−1(αj)n1/2( ˆFX−1(αj) − FX−1(αj)) − E(Y − µy|X = FX−1(αj−1))fX(FX−1(αj−1)n1/2( ˆFX−1(αj−1) − FX−1(αj−1)) + n−1/2 n X i=1 (Yi− µy)I(FX−1(αj−1) ≤ Xi ≤ FX−1(αj))] + op(1).

A representation for regression quantile ˆFx−1(α) as √ n( ˆFX−1(α)−FX−1(α)) = fX−1(FX−1(α))n−1/2 n X i=1 (α−I(Xi ≤ FX−1(α))+op(1). (7.4) may be seen in Ruppert and Carroll (1980). Moreover, we also have

n−1

n

X

i=1

I( ˆFX−1(αj−1) ≤ Xi ≤ ˆFX−1(αj)) = αj− αj−1+ op(1). (7.5)

By plugging (7.4) into (7.3) and with careful re-arrangement, the theorem is followed from (7.1)-(7.3) and (7.5).

Acknowledgement. We would like to acknowledge the help of Profes-sor Stephen Portnoy for his comments and suggestions concerning auxiliary information.

References

Bennette, C. and Vickers, A. (2012). Against quantiles: categorization of continuous variables in epidemiologic research, and its discontents. BMC Medical Research Methodology, 12: 21.

Bickel, P. J., Klaassen, C. A.J., Ritov, Y. and Wellner, J. A. (1998). Efficient and Adaptive Estimation for Semiparametric Models. Springer: New York.

Chen, L.-A. and Chiang, Y. C. (1996). Symmetric type quantile and trimmed means for location and linear regression model. Journal of Nonpara-metric Statistics. 7, 171-185.

(19)

Chen, L.-A., Chen, Dung-Tsa and Chan, Wenyaw. (2010). The p Value for the Outlier Sum in Differential Gene Expression Analysis. Biometrika, 97, 246-253.

Han, Y. (2008). Mathematical and empirical examinations of some epi-demiological procedures. Ph.D. Dissertation, School of Public Health, University of Texas-Health Science Center at Houston.

Irwin, J. R. and McClelland, G. H. (2003). Negative consequences of di-chotomizing continuous predictor variables. Journal of Marketing Re-search, 40, 366-371.

Kuk, A. and Mak, T. K. (1989). Median estimation in the presence of auxiliary information. Journal of the Royal Statistical Society B, 1, 261-269.

Le Cam, L. (1953). On some asymptotic properties of maximum likelihood estimates and related Bayes estimates. University of California Publi-cations in Statistics, 1, 277-330.

Letenneur L., Proust-Lima, C. et al. (2007). Flavonoid intake and cognitive decline over a 10-year period. American Journal of Epidemiology, 165: 1364-1371.

Luo J., Margolis K. L. et al. (2007). Body size, weight cycling, and risk of renal cell carcinoma among postmenopausal women: the women’s health initiative (United States). American Journal of Epidemiology, 166: 752-759.

MacCallum, R. C., Zhang, S., Preacher, K. J. and Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19-40.

Martinez-Miranda, M. D., Rueda, M. and Arcos, A. (2007). Looking for optimal auxiliary variables in sample survey quantile estimation. Sta-tistics, 41, 241-252.

MacCallum, R. C., Zhang, S., Preacher, K. J. and Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19-40.

(20)

Morgan, T. M. and Elashoff, R. M. (1986). Effect of categorizing a contin-uous covariate on the comparison of survival time.

Pocock, S. J., Collier, T. J. et al. (2004). Issues in reporting of epidemi-ological studies: a survey of recent practice. British Medical Journal, 329: 883.

Rao, J. N. K., Kovar, J. G. and Mantel, H. J. (1990). On estimating distribu-tion funcdistribu-tion and quantile from survey data using auxiliary informadistribu-tion. Biometrika, 77, 365-375.

Royston, P. and Sauerbrei, W. (2008). Multivariate Model-Building: A para-metric approach to regression analysis based on fractional polynomials for modeling continuous variables. Wiley.

Ruppert, D. and Carroll, R.J. (1980). Trimmed least squares estimation in the linear model. Journal of American Statistical Association 75, 828-838.

Shankar A., Klein R. et al. (2007). Association between glycosylated hemoglobin level and cardiovascular and all-cause mortality in type 1 diabets. American Journal of Epidemiology, 166: 393- 402.

Srivastava, S. K. (1971). A generalized estimator for the mean of a finite population using multiauziliary information. Journal of the American Statistical Association, 66, 404-407.

Taylor, J. M. G. and Yu, M. (2002). Bias and efficiency loss due to cate-gorizing an explanatory variable. Journal of Multivariate Analysis, 83, 2001-2045.

Walraven, C. and Hart, R. G. (2008). Leave’em alone-why continuous vari-ables should be analyzed as such. Neuroepidemiology, 30, 138-139. Zhao, L. P. and Kolonel, L. N. (1992). Efficiency loss from categorizing

quantitative exposures in case-control studies. American Journal of Epidemiology, 136, 464-474.

數據

Table 2. MSE’s for parametric estimator of asymptotic covariance matrix ΓV p Γ 0 (4 groups) Sample size σ yx = 0.2 0.3 0.5 0.7 0.8 (σ 2 x , σ y 2 ) = (1, 1) n = 50, M SE np 4.441 4.422 3.647 2.194 1.338 M SE p 0.079 0.074 0.049 0.023 0.012 n = 100, M SE np
Table 3. MSE’s for parametric and nonparametric estimations Sample size σ yx = 0.2 0.3 0.5 0.7 0.8 n = 30 M SE N 0.145 0.137 0.118 0.087 0.067 M SE p 0.063 0.059 0.051 0.036 0.027 n = 50 M SE N 0.082 0.079 0.066 0.049 0.038 M SE p 0.036 0.034 0.029 0.021 0
Table 5. Comparison of asymptotic variances (δ = 0.1)
Table 6. Comparison of asymptotic variances (δ = 0.2)

參考文獻

相關文件

Mean Value Theorem to F and G, we need to be sure the hypotheses of that result hold.... In order to apply

Survivor bias is that when we choose a sample from a current population to draw inferences about a past population, we leave out members of the past population who are not in

6 《中論·觀因緣品》,《佛藏要籍選刊》第 9 冊,上海古籍出版社 1994 年版,第 1

You are given the wavelength and total energy of a light pulse and asked to find the number of photons it

Quadratically convergent sequences generally converge much more quickly thank those that converge only linearly.

denote the successive intervals produced by the bisection algorithm... denote the successive intervals produced by the

Reading Task 6: Genre Structure and Language Features. • Now let’s look at how language features (e.g. sentence patterns) are connected to the structure

incapable to extract any quantities from QCD, nor to tackle the most interesting physics, namely, the spontaneously chiral symmetry breaking and the color confinement.. 