• 沒有找到結果。

On fast supervised learning for normal mixture models with missing information

N/A
N/A
Protected

Academic year: 2021

Share "On fast supervised learning for normal mixture models with missing information"

Copied!
11
0
0

加載中.... (立即查看全文)

全文

(1)

www.elsevier.com/locate/patcog

On fast supervised learning for normal mixture models

with missing information

Tsung I. Lin

a,

, Jack C. Lee

b

, Hsiu J. Ho

c

aDepartment of Applied Mathematics, National Chung Hsing University, Taichung 402, Taiwan bInstitute of Statistics and Graduate Institute of Finance, National Chiao Tung University, Hsinchu 300, Taiwan

cInstitute of Statistical Science, Academia Sinica, Taipei, Taiwan

Received 1 July 2005; received in revised form 7 November 2005; accepted 21 December 2005

Abstract

It is an important research issue to deal with mixture models when missing values occur in the data. In this paper, computational strategies using auxiliary indicator matrices are introduced for efficiently handling mixtures of multivariate normal distributions when the data are missing at random and have an arbitrary missing data pattern, meaning that missing data can occur anywhere. We develop a novel EM algorithm that can dramatically save computation time and be exploited in many applications, such as density estimation, supervised clustering and prediction of missing values. In the aspect of multiple imputations for missing data, we also offer a data augmentation scheme using the Gibbs sampler. Our proposed methodologies are illustrated through some real data sets with varying proportions of missing values.

䉷 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.

Keywords: Bayesian classifier; Data augmentation; EM algorithm; Incomplete features; Rao-Blackwellization

1. Introduction

Finite mixture models are known as powerful and flex-ible tools, which have been fully developed and applied in various theoretic and real problems as they are capable of modelling a wide range of densities, see for example Refs. [1–3]. However, missing values frequently appear in many real-world multivariate data sets that complicate data analyses and statistical inferences for practitioners. Missing data imputation techniques under the assump-tion of multivariate normal model have been well studied by Refs. [4,5]. Recently, learning mixture models from incomplete data becomes an important research issue in multivariate analysis. The work on the use of Gaussian component was pioneered by Ghahramani and Jordan [6], denoted by GJ hereafter. They present how to imple-ment the expectation–maximization (EM) algorithm [7] to compute maximum likelihood (ML) estimates from ∗Corresponding author. Tel.: +886 4 22850420; fax: +886 4 22873028.

E-mail address:[email protected](T.I. Lin).

0031-3203/$30.00䉷2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2005.12.014

multivariate data with arbitrary pattern of missingness. They also compare the performance of EM imputation with a common mean imputation (MI) heuristic for the supervised classification of incomplete features.

Due to rapid advance of computational developments, Bayesian sampling-based approaches are usually considered as an alternative way in dealing with mixture models. There are plenty of papers in the literature to address the problem of fitting normal mixture models under Bayesian treatments. For example, Diebolt and Robert[8] employ the data aug-mentation (DA) technique of Tanner and Wong[9]as an ap-proximation method for evaluating the posterior distribution and show a duality principle. Escobar and West[10]present a nonparametric Bayesian density estimation for Dirichlet process mixture models. Richardson and Green [11] and Zhang et al. [12] propose a full Bayesian inference for a normal mixture model with unknown number of compo-nents using the reversible jump MCMC algorithm proposed by Green[13]. Stephens[14]and Fruhwirth-Schnatter[15] demonstrate Bayesian strategies for the elimination of label switching problems.

(2)

In this paper, we offer an efficient EM algorithm for the fitting of a likelihood-based normal mixture model using par-tially observed data. To reduce computational burden during the EM iterations, we incorporate two types of auxiliary bi-nary indicator matrices corresponding to the observed and unobserved components of each datum. With strategies sim-ilar to EM, we also offer a DA computational technique for efficiently imputting missing values and learning parameters using the Gibbs sampler[16], which constructs a Markov chain that converges to a tractable posterior distribution. The feature of the chosen prior distributions is weakly informa-tive to avoid mathematical and computational pitfalls of us-ing improper priors in mixture model, see Celeux et al.[17]. The rest of the paper proceeds as follows. In the next sec-tion, we describe the model and its notations, and present some important statistical properties based on the missing information framework. In Sections 3 and 4, two efficient EM and DA algorithms are developed to cope with ML and Bayesian estimation, respectively. We also investigate two issues regarding classification and prediction of incomplete features from both ML and Bayesian perspectives. In Sec-tion 5, some real data sets are utilized to illustrate our pro-posed methodologies with varying proportions of artificially missing values. Also, empirical comparisons between ML and Bayesian approaches in terms of classification and pre-diction accuracies for incomplete features are demonstrated. Finally, some concluding remarks are given in Section 6.

2. A normal mixture model with missing information

In the normal mixture model, we assume that Y = (Y1, . . . ,Yn) form a p-dimensional random sample from a

population with g subclassesC1, . . . , Cg, and each Yj has

the density f (Yj | ) = g  i=1 wipYjµi, i, wi0, g  i=1 wi= 1, (1)

where wi’s are mixing probabilities, p(·|µ, ) de-notes a p-dimensional multivariate normal component density with mean µ and covariance matrix , and

 = (w1, . . . , wg, µ1, . . . , µg, 1, . . . , g) is the vec-tor of mixture model parameters subject to gi=1wi = 1 and i’s are positive definite matrices. Thus, there are g(p + 1)(p + 2)/2 − 1 distinct parameters in model (1).

Typically, in the EM framework, mixture models can be characterized as having an incomplete data structure. It is convenient to formalize the missing part as a set of membership labels Z = (Z1, . . . , Zn) with each la-bel Zj = (Z1j, . . . , Zgj) being a binary vector such that

Zij = 1 if Yj belongs to component i and Zij = 0 otherwise. Given the mixing probabilities, Z1, . . . , Znindependently

follow a multinomial distribution. We shall write ZjM(1; w1, . . . , wg).

For notational simplicity, let

ij = (Yj − µi)−1i (Yj− µi), (2) denote the Mahalanobis distance for Yjwith respect to mean

µi and covariance matrixi. The complete likelihood func-tion for is Lc(|Y, Z) ∝ n  j=1 g  i=1  wi|i|−1/2exp  −1 2ij Zij . (3) We consider the maximum likelihood estimation problem of model (1) when Y are not completely observed. We fur-ther assume that the patterns of missingness are arbitrary and missing at random (MAR), see Refs. [18,19]for more tails. Generally speaking, MAR refers to the missingness de-pending only on observed values but not on missing values. Let Yj be partitioned into two components (Yoj,Ymj), where Yoj (poj × 1) and Ymj ((p − pjo) × 1) denote the observed and missing components of Yj, respectively. To facilitate the EM algorithm, it is advantageous to introduce two types of binary indicator matrices, denoted by Oj and

Mj hereafter, corresponding to Yj such that Yoj= OjYj and

Ymj = MjYj, respectively. Notice that Oj and Mjare poj× p and (p− pjo) × p matrices extracted from a p-dimensional identity matrix Ip corresponding to row-positions of Yoj and Ymj in Yj, respectively. We then have the following two propositions.

Proposition 1. Suppose Yj is partitioned into two compo-nents(Yoj,Ymj), where Yoj = OjYj and Ymj = MjYj. We thus have Yj = Yoj, ifpoj= p; OjYoj+ MjYmj, if 1poj< p, and OjOj + MjMj= Ip.

Proof. The proof is straightforward and hence is omitted.

 Proposition 2. Let Yjg  i=1 wip(Yji, i), and let Yoj

and Ymj be the observed and missing components corre-sponding Yj, respectively. The marginal distribution of Yoj is denoted by Yoj ∼gi=1wipo j(Y o joij, ooij ), where po j(Y o joij, ooij ) = (2)−p o j/2|oo ij |−1/2exp(−12 o ij), and µo ij = Ojµi, ooij = OjiOj, o ij = (Yj − µi)Sooij (Yj− µi), Sooij = Oj(OjiOj)−1Oj. (4)

(3)

Consequently, Ymj|Yoj ∼gi=1wijp−po j(Y m jmij·o, mmij ·o), where p−po j(Y m jmij , ·o mmij ·o) = (2)−(p−p o j)/2|mm·o ij |−1/2 × exp(−1 2 m·o ij ), and wij = wipoj(Y o joij, ooij )/ g  h=1 whpo j(Y o johj, oohj), µm·o ij = Mji+ iSooij (Yj− µi)), mm·o ij = EijiMj, Eij = Mj(Ip− iSooij ), m·o ij = (Yj− µi)Smmij ·o(Yj− µi),

Smmij ·o= Eij(EijiMj)−1Eij. (5)

Proof. The sketch of the proof is given in Appendix A. 

To enhance the computational efficiency for estimation, we suggest to rearrange Y according to unique missing patterns of the data. The procedure can be implemented as follows:

(a) Build a binary n× p indicator matrix, R = [rij], with each entry rij =1 if Yij is missing and rij =0 otherwise. (b) Build a p×1 vector z=Rb, where b=(21, 22, . . . , 2p). Notice that the number of unique missing patterns is equal to the number of unique elements in z.

(c) Rank z in an ascending or descending order, denoted by z. Rearrange Y according to the row positions of zin z. This will yield clustering of identical patterns of missingness in Y which are adjacent to each other.

3. An efficient EM procedure for ML estimation

Let Yo = (Yo1, . . . ,Yon) and Ym = (Ym1, . . . ,Ymn) denote the observed portion and missing portion of the data, re-spectively. The complete-data log-likelihood function can be reexpressed by c(Yo,Ym, Z ) = c1(wY o,Ym, Z ) +  c2(Y o,Ym, Z ) = g  i=1 n  j=1 Zij log w i +1 2 g  i=1⎝log |−1 i | n  j=1 Zij− n  j=1Zij( o ij+mij )·o ⎞ ⎠, (6) where=(w1, . . . , wg) and =(µ1, . . . , µg, 1, . . . , g). From Eq. (5), it is easy to verify that−1i =Sooij +Smmij ·oand

OjOj(Ip−iSooij)=0. Hence, we have the following result.

Proposition 3. The conditional expectation of Eq. (6) is

given by Q(| ˆ(k)) = E(c( | Yo,Ym, Z) | Yo, ˆ(k)) = Q1(w| ˆ (k) ) + Q2(| ˆ (k) ). It follows that Q1(w| ˆ (k) ) = g  i=1 n  j=1 ˆZ(k) ij log wi, (7) Q2(| ˆ (k) ) =1 2 g  i=1⎝log |−1 i | n  j=1 ˆZ(k) ij −tr ⎛ ⎝−1 i n  j=1 (k)ij ⎞ ⎠ ⎞ ⎠ , (8) where (k)ij = ˆZij(k)((ˆY(k)ij − µi)(ˆY(k)ij − µi) + (Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i ), (9) ˆZ(k) ij = ˆwi(k)po j(Y o j| ˆµo(k)ij , ˆ oo(k) ij ) g h=1 ˆwh(k)po j(Y o j| ˆµo(k)hj , ˆ oo(k) hj ) , (10) ˆY(k)ij = ˆµ(k)i + ˆ (k) i ˆS oo(k) ij (Yj− ˆµ(k)i ), (11)

and ˆSoo(k)ij is Sooij given in Eq. (4) withi replaced by ˆ(k)i .

Proof. The detailed proof is shown in Appendix B. 

By these propositions, a modification of GJ’s EM algo-rithm can be implemented as follows:

E-step: Given  = ˆ(k), impute ˆZ(k)ij and ˆY(k)ij for i =

1, . . . , g and j = 1, . . . , n, using Eqs. (10) and (11). M-Step:

1. Update ˆwi(k) by maximizing Eq. (7) over wi subject to their sum is unity, which gives

ˆw(k+1)i =1n n  j=1 ˆZ(k) ij .

2. Fixi at ˆ(k)i , update ˆµ(k)i by maximizing Eq. (8) over

µi, which leads to ˆµ(k+1)i = n j=1 ˆZ(k)ij ˆY (k) ij n j=1 ˆZij(k) .

(4)

3. Fixµiat ˆµ(k+1)i , update ˆ (k)

i by maximizing constrained Eq. (8) overi, which leads to

ˆ(k+1)i = n j=1 ˆ (k) ij n j=1 ˆZij(k) ,

where ˆ(k)ij is (k)ij in Eq. (9) withµireplaced byˆµ(k+1)i . We remark two major advantages of the above EM al-gorithm:

(a) With auxiliary matrices Oj’s obtained at the initiation, there is no need to take care of the associated row po-sitions of missing values at each iteration.

(b) The implementation of the M-step has low computa-tional cost as it is similar to the case of no missing values. Therefore, the modified EM algorithm is more straightforward than the version of GJ.

Applying Bayes’ theorem, the posterior probability of the

Yj belonging toCi can be estimated by ˆwij = Pr(Zij = 1|Yo, ˆ) = ˆwipjo(Y o j| ˆµoij,ˆ oo ij ) g h=1 ˆwhpo j(Y o j| ˆµohj, ˆ oo hj) . (12)

By the ML classification theory[20], Yj is assigned toCs

if ˆwsj> ˆwij(i = 1, . . . , g; i = s).

Consequently, an ML predictor for the missing component

Ymj is given by ˆYm j = E(Ymj|Yo, ˆ) = Mj g  i=1 ˆwij( ˆµi+ ˆi ˆSooij (Yj− ˆµi)). (13)

4. A data augmentation scheme for Bayesian sampling

The DA algorithm[9]is a general and effective algorithm for producing multiple imputation of missing data. The DA has been broadly applied in a variety of missing data prob-lems, see Refs.[4,19]and references therein. In this section, we construct an efficient DA algorithm that combines the la-tent variables Z and unobserved data Ymfor simulating the posterior density of.

The DA algorithm consists of the imputation step (I-step) and the posterior step (P-(I-step). At the kth iteration of the DA algorithm, the I-step is defined by drawing impu-tations of Z(k)j and Ymj(k) from the predictive distributions p(Zj | Yo, (k)) and p(Ymj | Yo, Zj, (k)), respectively for

all j, and the P-step refers to generating(k+1)from p( |

Yo,Ym(k+1), Z(k+1)). If iterations are performed by a

suffi-ciently long burn-in period, then the simulations Z(k)j , Ymj(k) and(k) are distributed according to p(Zj | Yo), p(Ymj |

Yo) and p( | Yo), respectively for all k. To perform the

Bayesian inference for mixture models, it is necessary to choose a proper prior distribution for each parameter to avoid yielding improper posterior distributions[17].

In various mixture contexts, a vague Dirichlet distri-bution, denoted by D(, . . . , ), is the most natural prior for mixing probabilities w. Its density is proportional to w1−1· · · wg−1−1(1 − w1− · · · − wg−1)−1. For component

mean vectors µi, it is standard to adopt conjugate

Gaus-sian priors. As for the inverse covariance matrix −1i , the Wishart distribution is often chosen as a conjugate prior. A p-dimensional Wishart distribution with parameters  and A (p× p) is denoted by Wp(,A), and for p the

density is

f (U|A) ∝ |A|−/2|U|(−p−1)/2exp 

−1 2tr(UA

−1), where A is called a hyperparameter matrix if U is consid-ered random in Bayesian treatments.

Following the suggestion of [11,14]who base their rec-ommendation on the use of conjugate priors, our chosen priors are w∼ D(, . . . , ), µi ∼ Np  , −1 (i = 1, . . . , g), −1i | B ∼ Wp  2, (2B)−1  (i = 1, . . . , g), B∼ Wp  2, (2H)−1  ,

where B is a hyperparameter matrix with a conjugate Wishart distribution, and (, H, , , , ) are fixed as appropriate quantities to reflect the flatness of priors. The joint prior distribution function of and B is

(, B) ∝ |B|g+(2−p−1)/2exp(−tr(HB)) g  i=1 wi−1 × g  i=1 |−1i |(2−p−1)/2exp ×  −1 2i− ) (µi− ) − trB−1 i  . (14) Upon multiplying Eqs. (3) and (14), we have the following joint posterior density:

p(, B,Ym, Z|Yo) ∝ w−11 · · · wg−1| B|g+(2−p−1)/2exp(−tr(HB)) × g  i=1 exp  −1 2i− ) (µ i− )  ×−1i  (2−p−1)/2 exp(−tr(B−1i )) × n  j=1 g  i=1  wi | −1i |1/2exp  −1 2( o ij+mij )·o Zij , (15) whereoij and mij are given in Eqs. (4) and (5), respectively.·o

(5)

Proposition 4. The full conditional posteriors of, B, Z

and Ym are as follows (the symbol “| · · ·” denotes condi-tioning on all other variables):

p(Zj|Yo, ) ∝ g  i=1 (wipo j(Y o joij, ooij ) Z ij , p(Ym j|Zij = 1, · · ·) ∝ exp  −1 2(Y m

j − µmij )·o ijmm·o−1(Ymj − µmij )·o  , p(w| · · ·) ∝ g  i=1 w n j=1Zij+−1 i , p(µi| · · ·) ∝ exp  −1 2i− µi)∗ −1 i i− µi)  , p(B| · · ·) ∝ |B|(2(g+)−p−1)/2 × exp  −tr  B  H+ g  i=1 −1i  , p(−1i | · · ·) ∝−1i  (−p−1)/2 exp  −1 2tr( −1 i Ai)  , whereµmij·oandmmij ·oare given by Eq. (5), and

i = ⎛ ⎝−1 i n  j=1Zij +  ⎞ ⎠ −1 , (16) µi = i⎝−1 i n  j=1ZijY j+  ⎞ ⎠ , (17) i = n  j=1Zij + 2, (18) Ai = 2B + n  j=1Zij  Yj − µi Yj− µi, (19) fori = 1, . . . , g and j = 1, . . . , n.

Proof. The proof is straightforward and hence is omitted.

 In the simulation process, samples for Z, Ym, B and are alternately generated, the DA algorithm using the Gibbs sampler can be implemented as follows:

I-Step:

1. Given,Ymand Yo, generate ZjfromM(1; r1j, . . ., rgj), where rij = wip o j  Yojoij, ooij  g h=1whpo j  Yojohj, oohj .

2. Generate Ymj given Zij = 1,  and Yo, from Np−po

j



µm·o

ij , mmij ·o 

, whereµmij·oandmmij ·oare as in Eq. (5). P-Step:

1. Generate w given Z fromD(n1+ , . . . , ng+ ), where

ni=nj=1Zij.

2. Generateµi given Z,i, Yoand Ym fromNpµi, i  withµi andi given in Eqs. (17) and (16), respectively. 3. Generate B given1, . . . , g fromWp(2, (2H)−1),

where= g +  and H= H +gi=1−1i .

4. Generate−1i given Z,µi, Yoand YmfromWp(i, A−1i ), wherei and Ai are given in Eqs. (18) and (19), respec-tively.

To satisfy the “Principle of Stable Estimation” of Ed-wards et al.[21]in the Bayesian treatment, we need to spec-ify (, , , , H) so as to be insensitive to changes of the prior. Specifically, it is often to choose  = 1. For  and

, we let  be the empirical mean vector and −1= (1 −

)−1diag{R2

1, . . . , R2p}, where  is the percentage of missing

values of the data which is used to adjust the flatness and Ri

is the range of the observed values for the ith attribute. This specification makes a weak prior information for µi. As a

generalization of [11], we take = p + 1,  = (p + 1)/10 and H= 10.

We are interested in the classification and prediction problems for incomplete features. Under certain conditions, quantities based on Rao-Blackwellization[22]often greatly improve the precision of Monte Carlo estimates. Given a set of converged Monte Carlo DA samples()( = 1, . . . , L), a Bayesian predictor for Ymj is given by

 Ymj = 1 L L  =1 E(Ym j|Yoj, ()) =MjL1 L  =1  g  i=1 r() ij () i +()i Soo () ij (Yj−µ () i ))  , (20) where r() ij = w()i po j(Y o jo () ij ,  oo() ij ) g h=1w()h po j(Y o jo () hj , oo () hj ) .

Consequently, a Bayesian classifier for Yj can be estimated by averaging over the draws of()

ˆrij = Pr(Zij = 1|Y∗ o j) ≈ 1 L L  =1 r() ij . (21)

By the Bayesian classification rule, Yj is assigned toCs if ˆrsj> ˆrij(i = 1, . . . , g; i = s).

(6)

Table 1

A comparison of CPU time (in seconds) and relative reduced time (RRT) between GJ-EM algorithm (old) and the proposed procedure (new) under various missing rates (Replications= 500)

Data = 10% = 20% = 30%

Old New RRT (%) Old New RRT (%) Old New RRT (%)

Iris 12.47 1.22 90.2 21.51 1.61 92.5 56.21 3.61 93.6

Crabs 34.72 3.27 90.6 78.77 6.78 91.4 265.01 20.68 92.2

RRT= (old − new)/old × 100%

Table 2

A comparison of prediction accuracies for MI, EM and DA imputations with the standard deviations in parentheses for the iris data set (Replications=500)

(%) MAE MARE RMSE

MI EM DA MI EM DA MI EM DA 10 0.812 0.213 0.210 0.697 0.100 0.099 1.062 0.285 0.280 (0.081) (0.026) (0.026) (0.186) (0.027) (0.027) (0.096) (0.050) (0.050) 20 0.816 0.237 0.233 0.675 0.114 0.113 1.071 0.331 0.326 (0.053) (0.025) (0.025) (0.129) (0.031) (0.031) (0.065) (0.060) (0.060) 30 0.820 0.268 0.259 0.684 0.138 0.132 1.078 0.395 0.380 (0.046) (0.023) (0.022) (0.097) (0.033) (0.032) (0.058) (0.061) (0.060) 40 0.819 0.301 0.278 0.683 0.161 0.154 1.077 0.448 0.428 (0.035) (0.030) (0.026) (0.082) (0.038) (0.036) (0.041) (0.065) (0.063) 50 0.817 0.346 0.325 0.675 0.198 0.188 1.074 0.522 0.495 (0.029) (0.031) (0.028) (0.084) (0.043) (0.041) (0.036) (0.063) (0.060) Table 3

A comparison of prediction accuracies for MI, EM and DA imputations with the standard deviations in parentheses for the crabs data set (Replications=500)

(%) MAE MARE RMSE

MI EM DA MI EM DA MI EM DA 10 4.063 0.421 0.415 0.202 0.024 0.023 5.391 0.611 0.598 (0.337) (0.055) (0.050) (0.018) (0.003) (0.003) (0.427) (0.114) (0.105) 20 4.008 0.484 0.474 0.200 0.027 0.026 5.343 0.714 0.693 (0.227) (0.041) (0.037) (0.012) (0.002) (0.002) (0.305) (0.090) (0.083) 30 4.037 0.568 0.550 0.202 0.030 0.029 5.384 0.846 0.812 (0.169) (0.044) (0.041) (0.009) (0.002) (0.002) (0.225) (0.096) (0.091) 40 4.036 0.662 0.632 0.203 0.035 0.033 5.381 0.977 0.932 (0.138) (0.044) (0.042) (0.007) (0.002) (0.002) (0.188) (0.092) (0.094) 50 4.039 0.768 0.728 0.202 0.039 0.037 5.386 1.120 1.058 (0.108) (0.052) (0.050) (0.006) (0.002) (0.002) (0.142) (0.102) (0.100) 5. Experimental results

For illustration purposes, we start to apply results de-veloped in Sections 2–4 to two famous multivariate data sets. One is the iris data taken from Anderson [23] or Fisher [24]. It consists of four-dimensional measurements in centimeters on the attributes of petal length, petal width, sepal length and sepal width for 50 flower spec-imens of each of three species: setsosa, versicolor, and virginica. The other is the crabs data of Campbell and Mahon [25] on the gensus Leptograpsus. It consists of

Table 4

A comparison of average misclassification rates (%) between ML and Bayesian classifiers (replicates= 500)

(%) Iris Crabs ML Bayesian ML Bayesian 0 3.33 3.00 7.50 7.30 10 3.85 3.75 9.75 9.50 20 5.20 5.00 13.66 13.55 30 6.90 6.10 19.22 18.80 40 10.15 9.20 26.75 25.20 50 13.42 12.30 35.21 33.00

(7)

50 100 150 200 300 350 400 450 500 Freshwater Marine 300 350 400 450 500 Marine 300 350 400 450 500 Marine 300 350 400 450 500 Marine 50 100 150 200 Freshwater

Bayesian (missing rate = 0%) Bayesian (missing rate = 30%)

50 100 150 200 Freshwater

50 100 150 200 Freshwater

Fig. 1. ML and Bayesian density estimation for the two-component salmon data (•, both attributes are completely observed; , one of the two attributes is missing).

five-dimensional morphological measurements on the at-tributes of width of frontal lip, rear width, length along the mid-line of the carapace, maximum width of the carapace and body depth for 50 crabs of each of four groups: blue

male, blue female, orange male and orange female. Both data sets are included as a part of the R package, which is freely available at the web site http://cran.r-project. org.

(8)

To conduct experimental studies, we first generate 500 artificially missing data sets by deleting at random from the three data sets under various specified missing rate  (proportion of missing values) while we maintain each da-tum to have at least one observed attribute.Table 1presents the computation times of our developed EM algorithm and those of using GJ-EM. All computations are solely carried out by R package in the environment of a desktop PC (CPU: 3Gb-MHz/Intel Pentium 4 Processor; RAM: 1Gb/DDR-400). Since the programming implementations have many characteristics (e.g., vector or matrix subroutines instead of loops), the CPU times inTable 1might not be directly com-parable, but provide a sense of their actual performances in a practical setting. As seen in the table, all computation times are dramatically reduced over 90% by using the new EM procedure.

To exemplify the predictive performance for the EM and DA imputation methods, see Eqs. (13) and (20), together with the traditional mean imputation (MI) method, known as “filling-in” with the sample mean of the associated attribute, we utilize the pseudo-cross-validation (PSV) of Stone[26] to evaluate these three approaches. A relative tolerance of 10−8for the log-likelihood function and parameter estimates are used as the convergence criterion for the EM algorithm. As for the DA algorithm, we take the ML estimates as the initialization and carry out 2000 iterations with the first 1000 iterations as burn-in and the remaining 1000 iterations as in-ference samples. It is noted that our chosen burn-in number is much larger than needed based on checking the multi-variate potential scale reduction factor (MPSRF) of Brooks and Gelman[27]. As for discrepancy measures, we use the mean absolute error (MAE), the mean absolute relative error (MARE) and root mean square error (RMSE). Comparison results are listed inTables 2 and3. As seen in the tables, we found that both EM and DA substantially outperform MI for all cases. Furthermore, DA imputation exhibits con-siderable promising accuracy in the prediction of missing values when compared to the EM imputation, especially as the size of observed values becomes small (i.e., missing rate increases).

As another illustration, we attempt to explore classifica-tion accuracies between the ML classifier Eq. (12) and the Bayesian classifier Eq. (21) via PSV. Experimental results inTable 4 indicate that both classifiers are comparable at low-level missing, but Bayesian classifier yields lower mis-classification rates as the missing rate increases, though im-provements are not substantial.

Finally, we are interested in comparing behaviors of den-sity estimation from both ML-fitted and Bayesian posterior predictive aspects. To illustrate this, we use the salmon data taken from Johnson and Wichern [28]. This data set has two attributes, the diameter of rings for the first-year fresh-water growth and the diameter of rings for the first-year marine growth (both measured in hundredths of an inch), for each of 50 Alaskan-born and Canadian-born salmon fishes. The ML-fitted density estimation is obtained by

plugging the ML estimates into Eq. (1). As for Bayesian predictive density, it can be approximated by the use of Rao-Blackwellization p(y|Yo) =  p(y|Yo, )p(|Yo) dL1 L  =1 p(y|()) =L1 L  =1  g  i=1 w()i  (2)−p/2|()i |−1/2 × exp  −1 2 () ij  , (22) where ()ij =(y − µ()i )()i −1(y − µ()i ) and () ( = 1, . . . , L) is a set of converged Monte Carlo samples gen-erated from the DA algorithm.

The contour plots obtained by the ML-fitting and Bayesian predictive densities Eq. (22) for both completely observed data ( = 0%) and partially observed data ( = 30%) are de-picted in Fig. 1, respectively. Both look similar when data are not missing but using Eq. (22) seems to have a relatively smoother appearance. In addition, we found that the ML-fitted contour shapes tend to be distorted at high-level miss-ing and even for moderate-level missmiss-ing ( = 30%). How-ever, the distortion rarely happened while using Eq. (22). This indicates that Bayesian learning is more resistant to missing values.

6. Conclusions

In this paper, two novel EM and DA computational algo-rithms for learning normal mixture models under a missing information framework are presented. It should be empha-sized that our proposed procedures offer neat ways to pro-gram with low-cost computation. Experimental results indi-cate that Bayesian treatment is a worthwhile tool for mixture modelling under a considerable extent of missing informa-tion.

Recently, Bayesian and non-Bayesian robust mixture model modelling using the t distribution has received no-table attentions, see Refs.[29–32]. Future work will make some kind of comparisons theoretically or empirically among various competitive models.

Acknowledgments

The authors are grateful to the Editors and one anony-mous referee for their helpful comments and suggestions that substantially improve the presentation of the paper. This work was supported by the National Science Council of Taiwan.

(9)

Appendix A. Proof of Proposition 2

Suppose Y∼ Np(µ, ), then for any q ×p matrix A with

rank q(qp), we can obtain AY ∼ Np(Aµ,AA). With

similar arguments, the marginal distributions of Yoj and Ymj are: Yoj= OjYjg  i=1 wipo j o ij, ooij ), µoij = Ojµi, oo ij = OjiOj, Ymj = MjYjg  i=1 wip−po j m ij , mmij ), µmij = Mjµi, mm ij = MjiMj.

Note that theij in Eq. (2) can be reexpressed as ij =  Yoj − µoij Ymj − µmij oo ij omij mo ij mmij −1 Yoj− µoij Ymj − µmij  , (23) whereomij =OjiMj andmoij =MjiOj. Also, the second and third factors on the right hand side of Eq. (23) can be represented by oo ij omij mo ij mmij −1 =  I −ooij−1omij 0 I  ⎡ ⎣ oo−1 ij 0 0 mmij ·o−1 ⎤ ⎦ ×  I 0 −mo ij oo −1 ij I  , and  Yoj− µoij Ymj − µmij  =  Oj(Yj− µi) Mj(Yj− µi)  =  Oj Mj  (Yj− µi). We then have the following standard results:

mm·o ij = mmij − moij oo −1 ij omij = MjiMj − MjiOj(OjiOj)−1OjiMj = Mj(Ip− iSooij )iMj = EijiMj, where Eij = Mj(Ip− iSooij), Sooij =Oj(OjiOj)−1Oj. Since −mo ij oo −1 ij Oj+ Mj= Mj− MjiOj(OjiOj)−1Oj = Mj(Ip− iSooij ) = Eij and  I 0 −mo ij oo −1 ij I  O j Mj  =  Oj Eij  ,

it suffices to show that

µm·o ij = µmij + moij oo −1 ij (Yoj− µoij) = Mjµi+ MjiOj(OjiOj)−1Oj(Yj− µi) = Mji+ iSooij (Yj − µi)). Hence, ij = (Yo j− µoij)ooij−1(Yoj− µoij) + (Ym

j − µmij )·o ijmm·o−1(Ymj − µmij )·o = (Yj− µi)(Sij + Soo mmij ·o)(Yj− µi) = o ij + mij ,·o where o ij = (Yjo− µoij)ijoo−1(Yoj− µoij) = (Yj− µi)Sooij (Yj− µi), m·o ij = (Ymj − µmij )·o mm·o −1 ij (Ymj − µmij )·o = (Yj− µi)Smmij ·o(Yj − µi),

Smmij ·o= Eij(EijiMj)−1Eij.

Using the fact that|i| = |ooij ||mmij ·o| and above results, we have f (Ym j|Yoj) = f (Yj) f (Yo j) = g i=1wipo j(Y o joij, ooij )p−po j(Y m j|Yoj, µmij·o, mmij ·o) g i=1wipo j(Y o joij, ooij ) = g  i=1 wij p−po j(Y m j|Yoj, µmij ,·o mmij ·o), where wij=wipo j(Y o joij,ooij )/gh=1whpo j(Y o johj,oohj).

Appendix B. Proof of Proposition 3

Letting ˆZ(k)ij =E(Zij|Yo, ˆ(k)), ˆ(k)ij =E(ZijYj|Yo, ˆ(k))

and ˆ(k)ij =E(ZijYjYj|Yo, ˆ(k)), we can show that ˆZ(k) ij =Pr(Zij = 1|Y o j, ˆ (k) ) = ˆw (k) i po j(Y o j| ˆµo(k)ij , ˆ oo(k) ij ) g h=1 ˆw(k)h po j(Y o j| ˆµo(k)hj , ˆ oo(k) hj ) ,

ˆ(k)ij =Pr(Zij = 1|Yo, ˆ(k))E[Y

j|Zij = 1,Yo, ˆ(k)] = E(Zij|Yo, ˆ(k))E(Y

j|Zij = 1,Yo, ˆ(k)) = ˆZij(k)ˆY(k)ij ,

(10)

and

ˆ(k)ij = E(ZijYjYj|Yo, ˆ(k))

= E(Zij|Yo, ˆ(k))E(Y

jYj|Zij = 1,Yo, ˆ(k))

= ˆZ(k)ij (ˆY(k)ij ˆY(k)ij + Cov(Yj|Zij = 1,Yo, ˆ(k))) = ˆZ(k)ij (ˆY(k)ij ˆY(k)ij + (Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i ). Since Yj= OjYoj+ MjYjmand OjOj(Ip− iSooij) = 0, we have

ˆY(k)ij = E(Yj|Zij = 1,Yo, ˆ(k))

= E(OjYoj+ MjYmj|Zij = 1,Yo, ˆ (k)

) = OjYoj + MjE(Ymj|Zij = 1,Yo, ˆ

(k) ) = OjYoj + Mj ˆµmij·o(k) = OjOjYj+ MjMj( ˆµ(k)i + ˆ(k)i ˆS oo(k) ij (Yj− ˆµ(k)i )) = OjOjYj+ (Ip− OjOj)( ˆµ(k)i + ˆ(k)i ˆS oo(k) ij (Yj − ˆµ(k)i )) = ˆµ(k)i + ˆ (k) i ˆS oo(k) ij (Yj− ˆµ(k)i ) + OjOj(Ip− ˆ(k)i ˆS oo(k) ij )(Yj− ˆµ(k)i ) = ˆµ(k)i + ˆ(k)i ˆS oo(k) ij (Yj− ˆµ(k)i ), and

Cov(Yj|Zij = 1,Yo, ˆ(k))

= Cov(OjYoj+ MjYmj|Zij = 1,Yo, ˆ (k)

) = MjCov(Ymj|Zij = 1,Yo, ˆ

(k) )Mj = Mj ˆ mm·o(k) ij Mj = Mj ˆE (k) ij ˆ (k) i MjMj = MjMj(Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i MjMj = (Ip− OjOj)(Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i (Ip− OjOj) = (Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i (Ip− OjOj) = (Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i . Therefore, (k)ij = ˆ(k)ij − ˆ(k)ij µ i − µiˆ(k)ij + ˆZij µ(k) iµi = ˆZ(k)ij (ˆY(k)ij ˆY(k)ij + (Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i ) − 2 ˆZij(k)ˆY(k)ij µ i + ˆZ(k)ij µiµi = ˆZ(k)ij ((ˆY(k)ij − µi)(ˆY(k)ij − µi) + (Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i ). References

[1]D.M. Titterington, A.F.M. Smith, U.E. Markov, Statistical Analysis of Finite Mixture Distributions, Wiley, New York, 1985.

[2]G.J. McLachlan, K.E. Basford, Mixture Models: Inference and Application to Clustering, Marcel Dekker, New York, 1988.

[3]G.J. McLachlan, D. Peel, Finite Mixture Model, Wiley, New York, 2000.

[4]J.L. Schafer, Analysis of Incomplete Multivariate Data, Chapman & Hall, London, 1997.

[5]C.H. Liu, Efficient ML estimation of multivariate normal distribution from incomplete data, J. Multivariate. Anal. 69 (1999) 206–217.

[6]Z. Ghahramani, M.I. Jordan, Supervised learning from incomplete data via an EM approach, in: J.D. Cowan, G. Tesarro, J. Alspector (Eds.), Advances in Neural Information Processing Systems, vol. 6, Morgan Kaufmann, San Francisco, 1994, pp. 120–127.

[7]A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm (with discussion), J. R. Statist. Soc. B. 39 (1977) 1–38.

[8]J. Diebolt, C.P. Robert, Estimation of finite mixture distributions through Bayesian sampling, J. R. Statist. Soc. B. 56 (1994) 363–375.

[9]M.A. Tanner, W.H. Wong, The calculation of posterior distributions by data augmentation (with discussion), J. Am. Statist. Assoc. 82 (1987) 528–550.

[10]M.D. Escobar, M. West, Bayesian density estimation and inference using mixtures, J. Amer. Statist. Assoc. 90 (1995) 577–588.

[11]S. Richardson, P.J. Green, On Bayesian analysis of mixtures with an unknown number of components, J. R. Statist. Soc. B. 59 (1997) 731–792.

[12]Z.H. Zhang, K.L. Chan, Y.M. Wu, C.B. Chen, Learning a multivariate gaussian mixture model with the reversible jump MCMC algorithm, Statist. Comput. 14 (2004) 343–355.

[13]P.J. Green, Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, Biometrika 82 (1995) 711–732.

[14]M. Stephens, Bayesian analysis of mixture models with an unknown number of components—an alternative to reversible jump methods, Ann. Statist. 28 (2000) 40–74.

[15]S. Fruhwirth-Schnatter, Markov Chain Monte Carlo estimation of classical and dynamic switching and mixture models, J. Amer. Statist. Assoc. 96 (2001) 194–209.

[16]S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 721–741.

[17]G. Celeux, M. Hurn, C.P. Robert, Computational and inferential difficulties with mixture posterior distributions, J. Amer. Statist. Assoc. 95 (2000) 957–970.

[18]D.B. Rubin, Inference and missing data, Biometrika 63 (1976) 581–592.

[19]R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing Data, second ed., Wiley, New York, 2002.

[20]K.E. Basford, G.J. McLachlan, Estimation of allocation rates in a cluster analysis text, J. Amer. Statist. Assoc. 80 (1985) 286–293.

[21]W.H. Edwards, H. Lindman, L.J. Savage, Bayesian statistical inference for psychological research, Psycol. Rev. 70 (1963) 193–242.

[22]A.E. Gelfand, A.F.M. Smith, Sampling based approaches to calculate marginal densities, J. Amer. Statist. Assoc. 85 (1990) 398–409.

[23]E. Anderson, The irises of the Gaspé Peninsula, Bull. Am. Iris Soc. 59 (1935) 2–5.

[24]R.A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugenics 7 (1936) 179–188.

[25]N.A. Campbell, R.J. Mahon, A multivariate study of variation in two species of rock crab of genus Leptograpsus, Aust. J. Zool. 22 (1974) 417–425.

(11)

[26]M. Stone, Cross-validatory choice and assessment of statistical predi-ction (with discussion), J. R. Statist. Soc. B. 36 (1974) 111–147.

[27]S.P. Brooks, A. Gelman, General method s for monitoring convergence of iterative simulations, J. Comput. Graph. Statist. 7 (1998) 434–455.

[28]R.A. Johnson, D.W. Wichern, Applied Multivariate Statistical Analysis, fifth ed., Prentice-Hall, Englewood Cliffs, NJ, 2002.

[29]D. Peel, G.J. McLachlan, Robust mixture modeling using the t distribution, Statist. Comput. 10 (2000) 339–348.

[30]S. Shoham, Robust clustering by deterministic agglomeration EM of mixtures of multivariate t-distributions, Pattern Recognition 35 (2002) 1127–1142.

[31]T.I. Lin, J.C. Lee, H.F. Ni, Bayesian Analysis of mixture modelling using the multivariate t distribution, Statist. Comput. 14 (2004) 119–130.

[32]H.X. Wang, Q.B. Zhang, B. Luo, S. Wei, Robust mixture modelling using multivariate t distribution with missing information, Pattern Recognition Lett. 25 (2004) 701–710.

About the Author—TSUNG I. LIN received his B.A. in applied mathematics from National Chung Hsing University, Taiwan in 1993, the M.S. in

statistics from National Tsing Hua University, Taiwan in 1997 and Ph.D. in statistics from National Chiao Tung University, Taiwan in 2003. He is at present an assistant professor of National Chung Hsing University. Dr. Lin published papers in statistics and finance. His recent research includes computational statistics, robust mixture modelling and Bayesian analysis.

About the Author—JACK C. LEE received his B.A. in management from National Taiwan University, Taiwan in 1964, the M.S. in economics from the

University of Rochester in 1969 and Ph.D. in statistics from the State University of New York at Buffalo in 1972. Dr. Lee is a fellow of the American Statistical Association and an elected member of the International Statistical Institute. He is at present a University Chair Professor of National Chiao Tung University. Dr. Lee published over seventy papers in statistics, engineering and finance. His recent research includes multivariate analysis, speech recognition and finance.

About the Author—HSIU J. Ho received his B.A. in 2003 and M.S. in 2005, both in statistics from Tunghai University, Taiwan. He is currently a

research assistant under the supervision of Dr. Yi-Hau Chen in the Institute of Statistical Science at Academia Sinica, Taiwan. His research interests include pattern recognition and feature selection.

數據

Fig. 1. ML and Bayesian density estimation for the two-component salmon data ( •, both attributes are completely observed; , one of the two attributes is missing).

參考文獻

相關文件

Text A.. The activities that follow on p. 14-18 are designed to demonstrate how teachers can use “scaffolding strategies” to support student learning when using print media

2 Department of Educational Psychology and Counseling / Institute for Research Excellence in Learning Science, National Taiwan Normal University. Research on embodied cognition

In this paper, we build a new class of neural networks based on the smoothing method for NCP introduced by Haddou and Maheux [18] using some family F of smoothing functions.

In the past researches, all kinds of the clustering algorithms are proposed for dealing with high dimensional data in large data sets.. Nevertheless, almost all of

To facilitate data collection and input, this Bureau introduced an e-questionnaire for all local ordinary secondary day schools to report information on their

Please liaise with the officer in your school who are responsible for the Class and Subject Details Survey for using of the same class names in both the Class and Subject

To facilitate data collection and input, this Bureau introduced an e-questionnaire for all local ordinary secondary day schools to report information on their

To facilitate data collection and input, this Bureau introduced an e-questionnaire for all local ordinary secondary day schools to report information on their