On fast supervised learning for normal mixture models with missing information

(1)

www.elsevier.com/locate/patcog

On fast supervised learning for normal mixture models

with missing information

Tsung I. Lin

a,∗

, Jack C. Lee

b

, Hsiu J. Ho

c

a_{Department of Applied Mathematics, National Chung Hsing University, Taichung 402, Taiwan} b_{Institute of Statistics and Graduate Institute of Finance, National Chiao Tung University, Hsinchu 300, Taiwan}

c_{Institute of Statistical Science, Academia Sinica, Taipei, Taiwan}

Received 1 July 2005; received in revised form 7 November 2005; accepted 21 December 2005

Abstract

It is an important research issue to deal with mixture models when missing values occur in the data. In this paper, computational strategies using auxiliary indicator matrices are introduced for efﬁciently handling mixtures of multivariate normal distributions when the data are missing at random and have an arbitrary missing data pattern, meaning that missing data can occur anywhere. We develop a novel EM algorithm that can dramatically save computation time and be exploited in many applications, such as density estimation, supervised clustering and prediction of missing values. In the aspect of multiple imputations for missing data, we also offer a data augmentation scheme using the Gibbs sampler. Our proposed methodologies are illustrated through some real data sets with varying proportions of missing values.

Keywords: Bayesian classiﬁer; Data augmentation; EM algorithm; Incomplete features; Rao-Blackwellization

1. Introduction

Finite mixture models are known as powerful and ﬂex-ible tools, which have been fully developed and applied in various theoretic and real problems as they are capable of modelling a wide range of densities, see for example Refs. [1–3]. However, missing values frequently appear in many real-world multivariate data sets that complicate data analyses and statistical inferences for practitioners. Missing data imputation techniques under the assump-tion of multivariate normal model have been well studied by Refs. [4,5]. Recently, learning mixture models from incomplete data becomes an important research issue in multivariate analysis. The work on the use of Gaussian component was pioneered by Ghahramani and Jordan [6], denoted by GJ hereafter. They present how to imple-ment the expectation–maximization (EM) algorithm [7] to compute maximum likelihood (ML) estimates from ∗_{Corresponding author. Tel.: +886 4 22850420; fax: +886 4 22873028.}

E-mail address:[email protected](T.I. Lin).

multivariate data with arbitrary pattern of missingness. They also compare the performance of EM imputation with a common mean imputation (MI) heuristic for the supervised classiﬁcation of incomplete features.

Due to rapid advance of computational developments, Bayesian sampling-based approaches are usually considered as an alternative way in dealing with mixture models. There are plenty of papers in the literature to address the problem of ﬁtting normal mixture models under Bayesian treatments. For example, Diebolt and Robert[8] employ the data aug-mentation (DA) technique of Tanner and Wong[9]as an ap-proximation method for evaluating the posterior distribution and show a duality principle. Escobar and West[10]present a nonparametric Bayesian density estimation for Dirichlet process mixture models. Richardson and Green [11] and Zhang et al. [12] propose a full Bayesian inference for a normal mixture model with unknown number of compo-nents using the reversible jump MCMC algorithm proposed by Green[13]. Stephens[14]and Fruhwirth-Schnatter[15] demonstrate Bayesian strategies for the elimination of label switching problems.

(2)

In this paper, we offer an efficient EM algorithm for the fitting of a likelihood-based normal mixture model using par-tially observed data. To reduce computational burden during the EM iterations, we incorporate two types of auxiliary bi-nary indicator matrices corresponding to the observed and unobserved components of each datum. With strategies sim-ilar to EM, we also offer a DA computational technique for efficiently imputting missing values and learning parameters using the Gibbs sampler[16], which constructs a Markov chain that converges to a tractable posterior distribution. The feature of the chosen prior distributions is weakly informa-tive to avoid mathematical and computational pitfalls of us-ing improper priors in mixture model, see Celeux et al.[17]. The rest of the paper proceeds as follows. In the next sec-tion, we describe the model and its notations, and present some important statistical properties based on the missing information framework. In Sections 3 and 4, two efficient EM and DA algorithms are developed to cope with ML and Bayesian estimation, respectively. We also investigate two issues regarding classification and prediction of incomplete features from both ML and Bayesian perspectives. In Sec-tion 5, some real data sets are utilized to illustrate our pro-posed methodologies with varying proportions of artificially missing values. Also, empirical comparisons between ML and Bayesian approaches in terms of classification and pre-diction accuracies for incomplete features are demonstrated. Finally, some concluding remarks are given in Section 6.

2. A normal mixture model with missing information

In the normal mixture model, we assume that Y = (Y1, . . . ,Yn) form a p-dimensional random sample from a

population with g subclassesC1, . . . , Cg, and each Yj has

the density f (Yj | ) = g i=1 wipYjµi, i, wi0, g i=1 wi= 1, (1)

where wi’s are mixing probabilities, p(·|µ, ) de-notes a p-dimensional multivariate normal component density with mean µ and covariance matrix , and

= (w1, . . . , wg, µ1, . . . , µg, 1, . . . , g) is the vec-tor of mixture model parameters subject to g_i=1wi = 1 and i’s are positive deﬁnite matrices. Thus, there are g(p + 1)(p + 2)/2 − 1 distinct parameters in model (1).

Typically, in the EM framework, mixture models can be characterized as having an incomplete data structure. It is convenient to formalize the missing part as a set of membership labels Z = (Z1, . . . , Zn) with each la-bel Zj = (Z1j, . . . , Zgj) being a binary vector such that

Zij = 1 if Yj _{belongs to component i and Zij = 0 otherwise.} Given the mixing probabilities, Z1, . . . , Znindependently

follow a multinomial distribution. We shall write Zj ∼ M(1; w1, . . . , wg).

For notational simplicity, let

ij = (Yj − µi)−1i (Yj− µi), (2) denote the Mahalanobis distance for Yjwith respect to mean

µi and covariance matrixi. The complete likelihood func-tion for is Lc(|Y, Z) ∝ n j=1 g i=1 wi|i|−1/2exp −1 2ij Z_ij . (3) We consider the maximum likelihood estimation problem of model (1) when Y are not completely observed. We fur-ther assume that the patterns of missingness are arbitrary and missing at random (MAR), see Refs. [18,19]for more tails. Generally speaking, MAR refers to the missingness de-pending only on observed values but not on missing values. Let Yj be partitioned into two components (Yo_j,Ym_j), where Yo_j (po_j × 1) and Ym_j ((p − p_jo) × 1) denote the observed and missing components of Yj, respectively. To facilitate the EM algorithm, it is advantageous to introduce two types of binary indicator matrices, denoted by Oj and

Mj hereafter, corresponding to Yj such that Yo_j= OjYj and

Ym_j = MjYj, respectively. Notice that Oj and Mjare po_j× p and (p− p_jo) × p matrices extracted from a p-dimensional identity matrix Ip corresponding to row-positions of Yo_j and Ym_j in Yj, respectively. We then have the following two propositions.

Proposition 1. Suppose Yj is partitioned into two compo-nents(Yo_j,Ym_j), where Yo_j = OjYj and Ym_j = MjYj. We thus have Yj = Yo_j, ifpo_j= p; O_jYo_j+ M_jYm_j, if 1po_j< p, and O_jOj + MjMj= Ip.

Proof. The proof is straightforward and hence is omitted.

Proposition 2. Let Yj ∼ g i=1 wip(Yj|µi, i), and let Yoj

and Ym_j be the observed and missing components corre-sponding Yj, respectively. The marginal distribution of Yo_j is denoted by Yo_j ∼g_i=1wipo j(Y o j|µo_ij, oo_{ij ), where} po j(Y o j|µo_ij,oo_{ij ) = (2)}−p o j/2_|oo ij |−1/2exp(−12 o ij), and µo ij = Ojµi, ooij = OjiOj, o ij = (Yj − µi)Sooij (Yj− µi), Soo_{ij = O}_j(OjiO_j)−1Oj. (4)

(3)

Consequently, Ym_j|Yo_j ∼g_i=1w∗_ij_p−po j(Y m j|µm_ij·o, mm_ij ·o), where _p−po j(Y m j|µm_{ij ,}·o mm_ij ·o) = (2)−(p−p o j)/2_|mm·o ij |−1/2 × exp(−1 2 m·o ij ), and w∗ ij = wipoj(Y o j|µo_ij,oo_{ij )/} g h=1 whpo j(Y o j|µohj, oohj), µm·o ij = Mj(µi+ iSooij (Yj− µi)), mm·o ij = EijiMj, Eij = Mj(Ip− iSoo_{ij ),} m·o ij = (Yj− µi)Smmij ·o(Yj− µi),

Smm_ij ·o= E_ij(EijiM_j)−1_Eij. (5)

Proof. The sketch of the proof is given in Appendix A.

To enhance the computational efﬁciency for estimation, we suggest to rearrange Y according to unique missing patterns of the data. The procedure can be implemented as follows:

(a) Build a binary n× p indicator matrix, R = [rij], with each entry rij =1 if Yij is missing and rij =0 otherwise. (b) Build a p×1 vector z=Rb, where b=(21_{, 2}2_{, . . . , 2}p₎_. Notice that the number of unique missing patterns is equal to the number of unique elements in z.

(c) Rank z in an ascending or descending order, denoted by z∗. Rearrange Y according to the row positions of z∗ in z. This will yield clustering of identical patterns of missingness in Y which are adjacent to each other.

3. An efﬁcient EM procedure for ML estimation

Let Yo = (Yo₁, . . . ,Yo_n) and Ym = (Ym₁, . . . ,Ym_n) denote the observed portion and missing portion of the data, re-spectively. The complete-data log-likelihood function can be reexpressed by c(Yo,Ym, Z ) = c1(wY o_,Ym_{, Z ) +} c2(Y o_,Ym_{, Z )} = g i=1 n j=1 Zij log w i +1 2 g i=1 ⎛ ⎝log |−1 i | n j=1 Zij− n j=1Zij( o ij+mij )·o ⎞ ⎠, (6) where=(w1, . . . , wg) and =(µ1, . . . , µg, 1, . . . , g). From Eq. (5), it is easy to verify that−1_i =Soo_{ij +}Smm_ij ·oand

O_jOj(Ip−iSoo_ij)=0. Hence, we have the following result.

Proposition 3. The conditional expectation of Eq. (6) is

given by Q(| ˆ(k)) = E(c( | Yo,Ym, Z) | Yo, ˆ(k)) = Q1(w| ˆ (k) ) + Q2(| ˆ (k) ). It follows that Q1(w| ˆ (k) ) = g i=1 n j=1 ˆZ(k) ij log wi, (7) Q2(| ˆ (k) ) =1 2 g i=1 ⎛ ⎝log |−1 i | n j=1 ˆZ(k) ij −tr ⎛ ⎝−1 i n j=1 (k)_ij ⎞ ⎠ ⎞ ⎠ , (8) where (k)_{ij =} ˆZ_ij(k)((ˆY(k)_{ij − µ}_i)(ˆY(k)_{ij − µ}_i) + (Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i ), (9) ˆZ(k) ij = ˆwi(k)po j(Y o j| ˆµo(k)_ij , ˆ oo(k) ij ) g h=1 ˆwh(k)po j(Y o j| ˆµo(k)hj , ˆ oo(k) hj ) , (10) ˆY(k)_{ij = ˆ}µ(k)i + ˆ (k) i ˆS oo(k) ij (Yj− ˆµ(k)i ), (11)

and ˆSoo(k)_ij is Soo_ij given in Eq. (4) withi replaced by ˆ(k)_i .

Proof. The detailed proof is shown in Appendix B.

By these propositions, a modiﬁcation of GJ’s EM algo-rithm can be implemented as follows:

E-step: Given = ˆ(k), impute ˆZ(k)_ij and ˆY(k)_{ij for i =}

1, . . . , g and j = 1, . . . , n, using Eqs. (10) and (11). M-Step:

1. Update ˆw_i(k) by maximizing Eq. (7) over wi subject to their sum is unity, which gives

ˆw(k+1)_i =1_n n j=1 ˆZ(k) ij .

2. Fixi at ˆ(k)i , update ˆµ(k)i by maximizing Eq. (8) over

µi, which leads to ˆµ(k+1)i = n j=1 ˆZ(k)ij ˆY (k) ij n j=1 ˆZij(k) .

(4)

3. Fixµiat ˆµ(k+1)i , update ˆ (k)

i by maximizing constrained Eq. (8) overi, which leads to

ˆ(k+1)i = n j=1 ˆ (k) ij n j=1 ˆZij(k) ,

where ˆ(k)_{ij is}(k)_ij in Eq. (9) withµireplaced byˆµ(k+1)i . We remark two major advantages of the above EM al-gorithm:

(a) With auxiliary matrices Oj’s obtained at the initiation, there is no need to take care of the associated row po-sitions of missing values at each iteration.

(b) The implementation of the M-step has low computa-tional cost as it is similar to the case of no missing values. Therefore, the modiﬁed EM algorithm is more straightforward than the version of GJ.

Applying Bayes’ theorem, the posterior probability of the

Yj belonging toCi can be estimated by ˆw∗_{ij = Pr(Zij = 1|Y}o_{, ˆ)} = ˆwipjo(Y o j| ˆµo_ij,ˆ oo ij ) g h=1 ˆwhpo j(Y o j| ˆµohj, ˆ oo hj) . (12)

By the ML classiﬁcation theory[20], Yj is assigned toCs

if ˆw_sj∗ > ˆw∗_ij(i = 1, . . . , g; i = s).

Consequently, an ML predictor for the missing component

Ym_j is given by ˆYm j = E(Ymj|Yo, ˆ) = Mj g i=1 ˆw∗_{ij( ˆ}µi+ ˆi ˆSoo_{ij (Y}j− ˆµi)). (13)

4. A data augmentation scheme for Bayesian sampling

The DA algorithm[9]is a general and effective algorithm for producing multiple imputation of missing data. The DA has been broadly applied in a variety of missing data prob-lems, see Refs.[4,19]and references therein. In this section, we construct an efﬁcient DA algorithm that combines the la-tent variables Z and unobserved data Ymfor simulating the posterior density of.

The DA algorithm consists of the imputation step (I-step) and the posterior step (P-(I-step). At the kth iteration of the DA algorithm, the I-step is deﬁned by drawing impu-tations of Z(k)_j and Ym_j(k) from the predictive distributions p(Zj | Yo, (k)) and p(Ym_j | Yo, Zj, (k)), respectively for

all j, and the P-step refers to generating(k+1)from p( |

Yo,Ym(k+1), Z(k+1)). If iterations are performed by a

sufﬁ-ciently long burn-in period, then the simulations Z(k)_j , Ym_j(k) and(k) are distributed according to p(Zj | Yo), p(Ym_j |

Yo) and p( | Yo), respectively for all k. To perform the

Bayesian inference for mixture models, it is necessary to choose a proper prior distribution for each parameter to avoid yielding improper posterior distributions[17].

In various mixture contexts, a vague Dirichlet distri-bution, denoted by D(, . . . , ), is the most natural prior for mixing probabilities w. Its density is proportional to w₁−1· · · w_g−1−1(1 − w1− · · · − wg−1)−1. For component

mean vectors µi, it is standard to adopt conjugate

Gaus-sian priors. As for the inverse covariance matrix −1_i , the Wishart distribution is often chosen as a conjugate prior. A p-dimensional Wishart distribution with parameters and A (p× p) is denoted by Wp(,A), and for p the

density is

f (U|A) ∝ |A|−/2|U|(−p−1)/2exp

−1 2tr(UA

−1₎_, where A is called a hyperparameter matrix if U is consid-ered random in Bayesian treatments.

Following the suggestion of [11,14]who base their rec-ommendation on the use of conjugate priors, our chosen priors are w∼ D(, . . . , ), µi ∼ Np , −1 (i = 1, . . . , g), −1i | B ∼ Wp 2, (2B)−1 (i = 1, . . . , g), B∼ Wp 2, (2H)−1 ,

where B is a hyperparameter matrix with a conjugate Wishart distribution, and (, H, , , , ) are fixed as appropriate quantities to reflect the flatness of priors. The joint prior distribution function of and B is

(, B) ∝ |B|g+(2−p−1)/2exp(−tr(HB)) g i=1 w_i−1 × g i=1 |−1i |(2−p−1)/2exp × −1 2(µi− ) _(µ_i_{− ) − tr}_B−1 i . (14) Upon multiplying Eqs. (3) and (14), we have the following joint posterior density:

p(, B,Ym_{, Z|Y}o₎ ∝ w−11 · · · wg−1| B|g+(2−p−1)/2exp(−tr(HB)) × g i=1 exp −1 2(µi− ) _(µ i− ) ×−1i (2−p−1)/2 exp(−tr(B−1i )) × n j=1 g i=1 wi | −1i |1/2exp −1 2( o ij+mij )·o Z_ij , (15) whereo_{ij and}m_{ij are given in Eqs. (4) and (5), respectively.}·o

(5)

Proposition 4. The full conditional posteriors of, B, Z

and Ym are as follows (the symbol “| · · ·” denotes condi-tioning on all other variables):

p(Zj|Yo, ) ∝ g i=1 (wipo j(Y o j|µo_ij,oo_{ij )} Z ij , p(Ym j|Zij = 1, · · ·) ∝ exp −1 2(Y m

j − µm_{ij )}·o _ijmm·o−1(Ymj − µm_{ij )}·o , p(w| · · ·) ∝ g i=1 w n j=1Zij+−1 i , p(µi| · · ·) ∝ exp −1 2(µi− µ ∗ i)∗ −1 i (µi− µ∗i) , p(B| · · ·) ∝ |B|(2(g+)−p−1)/2 × exp −tr B H+ g i=1 −1i , p(−1_i | · · ·) ∝−1i (∗_−p−1)/2 exp −1 2tr( −1 i Ai) , whereµm_ij·oandmm_ij ·oare given by Eq. (5), and

∗i = ⎛ ⎝−1 i n j=1Zij + ⎞ ⎠ −1 , (16) µ∗i = ∗i ⎛ ⎝−1 i n j=1ZijY j+ ⎞ ⎠ , (17) ∗ i = n j=1Zij + 2, (18) Ai = 2B + n j=1Zij Yj − µi Yj− µi, (19) fori = 1, . . . , g and j = 1, . . . , n.

Proof. The proof is straightforward and hence is omitted.

In the simulation process, samples for Z, Ym, B and are alternately generated, the DA algorithm using the Gibbs sampler can be implemented as follows:

I-Step:

1. Given,Ymand Yo, generate ZjfromM(1; r1j, . . ., rgj), where rij = wip o j Yo_j|µo_ij, oo_ij g h=1whpo j Yo_j|µo_hj, oo_hj .

2. Generate Ym_j _{given Zij = 1, and Y}o, from N_p−po

j

µm·o

ij , mmij ·o

, whereµm_ij·oandmm_ij ·oare as in Eq. (5). P-Step:

1. Generate w given Z fromD(n1+ , . . . , ng+ ), where

ni=n_j=1Zij.

2. Generateµi given Z,i, Yoand Ym fromNpµ∗i, ∗i withµ∗_i and∗_i given in Eqs. (17) and (16), respectively. 3. Generate B given1, . . . , g fromWp(2∗, (2H∗)−1),

where∗= g + and H∗= H +g_i=1−1_i .

4. Generate−1_i given Z,µi, Yoand YmfromWp(∗_i, A−1_i ), where∗_i and Ai are given in Eqs. (18) and (19), respec-tively.

To satisfy the “Principle of Stable Estimation” of Ed-wards et al.[21]in the Bayesian treatment, we need to spec-ify (, , , , H) so as to be insensitive to changes of the prior. Speciﬁcally, it is often to choose = 1. For and

, we let be the empirical mean vector and −1_{= (1 −}

)−1_diag{R2

1, . . . , R2p}, where is the percentage of missing

values of the data which is used to adjust the ﬂatness and Ri

is the range of the observed values for the ith attribute. This speciﬁcation makes a weak prior information for µi. As a

generalization of [11], we take = p + 1, = (p + 1)/10 and H= 10.

We are interested in the classiﬁcation and prediction problems for incomplete features. Under certain conditions, quantities based on Rao-Blackwellization[22]often greatly improve the precision of Monte Carlo estimates. Given a set of converged Monte Carlo DA samples()( = 1, . . . , L), a Bayesian predictor for Ym_j is given by

Ym_j = 1 L L =1 E(Ym j|Yoj, ()) =Mj_L1 L =1 _g i=1 r() ij (µ () i +()i Soo () ij (Yj−µ () i )) , (20) where r() ij = w()_i po j(Y o j|µo () ij , oo() ij ) g h=1w()h po j(Y o j|µo () hj , oo () hj ) .

Consequently, a Bayesian classiﬁer for Yj can be estimated by averaging over the draws of()

ˆr_{ij = Pr(Zij = 1|Y}∗ o j) ≈ 1 L L =1 r() ij . (21)

By the Bayesian classiﬁcation rule, Yj is assigned toCs if ˆr_sj∗ > ˆr_ij∗ (i = 1, . . . , g; i = s).

(6)

Table 1

A comparison of CPU time (in seconds) and relative reduced time (RRT) between GJ-EM algorithm (old) and the proposed procedure (new) under various missing rates (Replications= 500)

Data = 10% = 20% = 30%

Old New RRT (%) Old New RRT (%) Old New RRT (%)

Iris 12.47 1.22 90.2 21.51 1.61 92.5 56.21 3.61 93.6

Crabs 34.72 3.27 90.6 78.77 6.78 91.4 265.01 20.68 92.2

RRT= (old − new)/old × 100%

Table 2

A comparison of prediction accuracies for MI, EM and DA imputations with the standard deviations in parentheses for the iris data set (Replications=500)

(%) MAE MARE RMSE

MI EM DA MI EM DA MI EM DA 10 0.812 0.213 0.210 0.697 0.100 0.099 1.062 0.285 0.280 (0.081) (0.026) (0.026) (0.186) (0.027) (0.027) (0.096) (0.050) (0.050) 20 0.816 0.237 0.233 0.675 0.114 0.113 1.071 0.331 0.326 (0.053) (0.025) (0.025) (0.129) (0.031) (0.031) (0.065) (0.060) (0.060) 30 0.820 0.268 0.259 0.684 0.138 0.132 1.078 0.395 0.380 (0.046) (0.023) (0.022) (0.097) (0.033) (0.032) (0.058) (0.061) (0.060) 40 0.819 0.301 0.278 0.683 0.161 0.154 1.077 0.448 0.428 (0.035) (0.030) (0.026) (0.082) (0.038) (0.036) (0.041) (0.065) (0.063) 50 0.817 0.346 0.325 0.675 0.198 0.188 1.074 0.522 0.495 (0.029) (0.031) (0.028) (0.084) (0.043) (0.041) (0.036) (0.063) (0.060) Table 3

A comparison of prediction accuracies for MI, EM and DA imputations with the standard deviations in parentheses for the crabs data set (Replications=500)

(%) MAE MARE RMSE

MI EM DA MI EM DA MI EM DA 10 4.063 0.421 0.415 0.202 0.024 0.023 5.391 0.611 0.598 (0.337) (0.055) (0.050) (0.018) (0.003) (0.003) (0.427) (0.114) (0.105) 20 4.008 0.484 0.474 0.200 0.027 0.026 5.343 0.714 0.693 (0.227) (0.041) (0.037) (0.012) (0.002) (0.002) (0.305) (0.090) (0.083) 30 4.037 0.568 0.550 0.202 0.030 0.029 5.384 0.846 0.812 (0.169) (0.044) (0.041) (0.009) (0.002) (0.002) (0.225) (0.096) (0.091) 40 4.036 0.662 0.632 0.203 0.035 0.033 5.381 0.977 0.932 (0.138) (0.044) (0.042) (0.007) (0.002) (0.002) (0.188) (0.092) (0.094) 50 4.039 0.768 0.728 0.202 0.039 0.037 5.386 1.120 1.058 (0.108) (0.052) (0.050) (0.006) (0.002) (0.002) (0.142) (0.102) (0.100) 5. Experimental results

For illustration purposes, we start to apply results de-veloped in Sections 2–4 to two famous multivariate data sets. One is the iris data taken from Anderson [23] or Fisher [24]. It consists of four-dimensional measurements in centimeters on the attributes of petal length, petal width, sepal length and sepal width for 50 ﬂower spec-imens of each of three species: setsosa, versicolor, and virginica. The other is the crabs data of Campbell and Mahon [25] on the gensus Leptograpsus. It consists of

Table 4

A comparison of average misclassiﬁcation rates (%) between ML and Bayesian classiﬁers (replicates= 500)

(%) Iris Crabs ML Bayesian ML Bayesian 0 3.33 3.00 7.50 7.30 10 3.85 3.75 9.75 9.50 20 5.20 5.00 13.66 13.55 30 6.90 6.10 19.22 18.80 40 10.15 9.20 26.75 25.20 50 13.42 12.30 35.21 33.00

(7)

50 100 150 200 300 350 400 450 500 Freshwater Marine 300 350 400 450 500 Marine 300 350 400 450 500 Marine 300 350 400 450 500 Marine 50 100 150 200 Freshwater

Bayesian (missing rate = 0%) Bayesian (missing rate = 30%)

50 100 150 200 Freshwater

Fig. 1. ML and Bayesian density estimation for the two-component salmon data (•, both attributes are completely observed; , one of the two attributes is missing).

ﬁve-dimensional morphological measurements on the at-tributes of width of frontal lip, rear width, length along the mid-line of the carapace, maximum width of the carapace and body depth for 50 crabs of each of four groups: blue

male, blue female, orange male and orange female. Both data sets are included as a part of the R package, which is freely available at the web site http://cran.r-project. org.

(8)

To conduct experimental studies, we first generate 500 artificially missing data sets by deleting at random from the three data sets under various specified missing rate (proportion of missing values) while we maintain each da-tum to have at least one observed attribute.Table 1presents the computation times of our developed EM algorithm and those of using GJ-EM. All computations are solely carried out by R package in the environment of a desktop PC (CPU: 3Gb-MHz/Intel Pentium 4 Processor; RAM: 1Gb/DDR-400). Since the programming implementations have many characteristics (e.g., vector or matrix subroutines instead of loops), the CPU times inTable 1might not be directly com-parable, but provide a sense of their actual performances in a practical setting. As seen in the table, all computation times are dramatically reduced over 90% by using the new EM procedure.

To exemplify the predictive performance for the EM and DA imputation methods, see Eqs. (13) and (20), together with the traditional mean imputation (MI) method, known as “ﬁlling-in” with the sample mean of the associated attribute, we utilize the pseudo-cross-validation (PSV) of Stone[26] to evaluate these three approaches. A relative tolerance of 10−8for the log-likelihood function and parameter estimates are used as the convergence criterion for the EM algorithm. As for the DA algorithm, we take the ML estimates as the initialization and carry out 2000 iterations with the ﬁrst 1000 iterations as burn-in and the remaining 1000 iterations as in-ference samples. It is noted that our chosen burn-in number is much larger than needed based on checking the multi-variate potential scale reduction factor (MPSRF) of Brooks and Gelman[27]. As for discrepancy measures, we use the mean absolute error (MAE), the mean absolute relative error (MARE) and root mean square error (RMSE). Comparison results are listed inTables 2 and3. As seen in the tables, we found that both EM and DA substantially outperform MI for all cases. Furthermore, DA imputation exhibits con-siderable promising accuracy in the prediction of missing values when compared to the EM imputation, especially as the size of observed values becomes small (i.e., missing rate increases).

As another illustration, we attempt to explore classifica-tion accuracies between the ML classifier Eq. (12) and the Bayesian classifier Eq. (21) via PSV. Experimental results inTable 4 indicate that both classifiers are comparable at low-level missing, but Bayesian classifier yields lower mis-classification rates as the missing rate increases, though im-provements are not substantial.

Finally, we are interested in comparing behaviors of den-sity estimation from both ML-fitted and Bayesian posterior predictive aspects. To illustrate this, we use the salmon data taken from Johnson and Wichern [28]. This data set has two attributes, the diameter of rings for the first-year fresh-water growth and the diameter of rings for the first-year marine growth (both measured in hundredths of an inch), for each of 50 Alaskan-born and Canadian-born salmon fishes. The ML-fitted density estimation is obtained by

plugging the ML estimates into Eq. (1). As for Bayesian predictive density, it can be approximated by the use of Rao-Blackwellization p(y|Yo_{) =} p(y|Yo_{, )p(|Y}o_{) d} ≈_L1 L =1 p(y|()) =_L1 L =1 _g i=1 w()_i (2)−p/2|()i |−1/2 × exp −1 2 () ij , (22) where ()_{ij =}(y − µ()_i )()_i −1(y − µ()_i ) and () ( = 1, . . . , L) is a set of converged Monte Carlo samples gen-erated from the DA algorithm.

The contour plots obtained by the ML-ﬁtting and Bayesian predictive densities Eq. (22) for both completely observed data ( = 0%) and partially observed data ( = 30%) are de-picted in Fig. 1, respectively. Both look similar when data are not missing but using Eq. (22) seems to have a relatively smoother appearance. In addition, we found that the ML-ﬁtted contour shapes tend to be distorted at high-level miss-ing and even for moderate-level missmiss-ing ( = 30%). How-ever, the distortion rarely happened while using Eq. (22). This indicates that Bayesian learning is more resistant to missing values.

6. Conclusions

In this paper, two novel EM and DA computational algo-rithms for learning normal mixture models under a missing information framework are presented. It should be empha-sized that our proposed procedures offer neat ways to pro-gram with low-cost computation. Experimental results indi-cate that Bayesian treatment is a worthwhile tool for mixture modelling under a considerable extent of missing informa-tion.

Recently, Bayesian and non-Bayesian robust mixture model modelling using the t distribution has received no-table attentions, see Refs.[29–32]. Future work will make some kind of comparisons theoretically or empirically among various competitive models.

Acknowledgments

The authors are grateful to the Editors and one anony-mous referee for their helpful comments and suggestions that substantially improve the presentation of the paper. This work was supported by the National Science Council of Taiwan.

(9)

Appendix A. Proof of Proposition 2

Suppose Y∼ Np(µ, ), then for any q ×p matrix A with

rank q(qp), we can obtain AY ∼ Np(Aµ,AA). With

similar arguments, the marginal distributions of Yo_j and Ym_j are: Yo_j= OjYj ∼ g i=1 wipo j(µ o ij, ooij ), µoij = Ojµi, oo ij = OjiOj, Ym_j = MjYj ∼ g i=1 wi_p−po j(µ m ij , mmij ), µmij = Mjµi, mm ij = MjiMj.

Note that theij in Eq. (2) can be reexpressed as ij = Yo_j − µo_ij Ym_j − µm_ij oo ij omij mo ij mmij ₋₁ Yo_j− µo_ij Ym_j − µm_ij , (23) whereom_{ij =O}jiM_j andmo_{ij =M}jiO_j. Also, the second and third factors on the right hand side of Eq. (23) can be represented by oo ij omij mo ij mmij ₋₁ = I −oo_ij−1om_ij 0 I ⎡ ⎣ oo−1 ij 0 0 mm_ij ·o−1 ⎤ ⎦ × _I ₀ −mo ij oo −1 ij I , and Yo_j− µo_ij Ym_j − µm_ij = Oj(Yj− µi) Mj(Yj− µi) = Oj Mj (Yj− µi). We then have the following standard results:

mm·o ij = mmij − moij oo −1 ij omij = MjiM_j − MjiO_j(OjiO_j)−1OjiM_j = Mj(Ip− iSoo_{ij )}iM_j = EijiM_j, where Eij = Mj(Ip− iSoo_ij), Soo_{ij =}O_j(OjiO_j)−1Oj. Since −mo ij oo −1 ij Oj+ Mj= Mj− MjiOj(OjiO_j)−1Oj = Mj(Ip− iSoo_{ij )} = Eij and I 0 −mo ij oo −1 ij I _O j Mj = Oj Eij ,

it sufﬁces to show that

µm·o ij = µmij + moij oo −1 ij (Yoj− µoij) = Mjµi+ MjiO_j(OjiO_j)−1Oj(Yj− µi) = Mj(µi+ iSoo_{ij (Y}j − µi)). Hence, ij = (Yo j− µo_ij)oo_ij−1(Yoj− µo_ij) + (Ym

j − µm_{ij )}·o _ijmm·o−1(Ymj − µm_{ij )}·o = (Yj− µi)(S_{ij + S}oo mm_ij ·o)(Yj− µi) = o ij + mij ,·o where o ij = (Yjo− µo_ij)_ijoo−1(Yoj− µo_ij) = (Yj− µi)Soo_{ij (Y}j− µi), m·o ij = (Ymj − µm_{ij )}·o mm·o −1 ij (Ymj − µm_{ij )}·o = (Yj− µi)Smm_ij ·o(Yj − µi),

Smm_ij ·o= E_ij(EijiM_j)−1_Eij.

Using the fact that|i| = |oo_{ij ||}mm_ij ·o| and above results, we have f (Ym j|Yoj) = f (Yj) f (Yo j) = g i=1wipo j(Y o j|µo_ij, oo_{ij )}p−po j(Y m j|Yoj, µm_ij·o, mm_ij ·o) g i=1wipo j(Y o j|µo_ij, oo_{ij )} = g i=1 w_ij∗ _p−po j(Y m j|Yoj, µm_{ij ,}·o mm_ij ·o), where w_ij=∗ wipo j(Y o j|µo_ij,oo_{ij )/}g_h=1whpo j(Y o j|µohj,oohj).

Appendix B. Proof of Proposition 3

Letting ˆZ(k)_{ij =}E(Zij|Yo, ˆ(k)), ˆ(k)_{ij =}E(ZijYj|Yo, ˆ(k))

and ˆ(k)_{ij =E(ZijY}jY_j|Yo, ˆ(k)), we can show that ˆZ(k) ij =Pr(Zij = 1|Y o j, ˆ (k) ) = ˆw (k) i po j(Y o j| ˆµo(k)_ij , ˆ oo(k) ij ) g h=1 ˆw(k)h po j(Y o j| ˆµo(k)hj , ˆ oo(k) hj ) ,

ˆ(k)_{ij =}_{Pr(Zij = 1|Y}o_{, ˆ}(k)_)E[Y

j|Zij = 1,Yo, ˆ(k)] = E(Zij|Yo_{, ˆ}(k)_)E(Y

j|Zij = 1,Yo, ˆ(k)) = ˆZ_ij(k)ˆY(k)_{ij ,}

(10)

and

ˆ(k)ij = E(ZijYjY_j|Yo, ˆ(k))

= E(Zij|Yo_{, ˆ}(k)_)E(Y

jY_j|Zij = 1,Yo, ˆ(k))

= ˆZ(k)_ij (ˆY(k)_{ij ˆ}Y(k)_ij + Cov(Yj|Zij = 1,Yo, ˆ(k))) = ˆZ(k)_ij (ˆY(k)_{ij ˆ}Y(k)_ij + (Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i ). Since Yj= OjYoj+ MjYjmand OjOj(Ip− iSoo_ij) = 0, we have

ˆY(k)_{ij = E(Y}j|Zij = 1,Yo, ˆ(k))

= E(OjYoj+ MjYmj|Zij = 1,Yo, ˆ (k)

) = OjYoj + MjE(Ymj|Zij = 1,Yo, ˆ

(k) ) = OjYoj + Mj ˆµm_ij·o(k) = OjOjYj+ MjMj( ˆµ(k)_i + ˆ(k)i ˆS oo(k) ij (Yj− ˆµ(k)i )) = OjOjYj+ (Ip− OjOj)( ˆµ(k)i + ˆ(k)i ˆS oo(k) ij (Yj − ˆµ(k)i )) = ˆµ(k)i + ˆ (k) i ˆS oo(k) ij (Yj− ˆµ(k)i ) + OjOj(Ip− ˆ(k)i ˆS oo(k) ij )(Yj− ˆµ(k)i ) = ˆµ(k)_i + ˆ(k)i ˆS oo(k) ij (Yj− ˆµ(k)i ), and

Cov(Yj|Zij = 1,Yo, ˆ(k))

= Cov(OjYoj+ MjYmj|Zij = 1,Yo, ˆ (k)

) = MjCov(Ymj|Zij = 1,Yo, ˆ

(k) )Mj = Mj ˆ mm·o(k) ij Mj = Mj ˆE (k) ij ˆ (k) i MjMj = MjMj(Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i MjMj = (Ip− OjOj)(Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i (Ip− OjOj) = (Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i (Ip− OjOj) = (Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i . Therefore, (k)_{ij =} ˆ(k)_{ij − ˆ}(k)_{ij µ} i − µiˆ(k)ij + ˆZ_{ij µ}(k) iµi = ˆZ(k)_ij (ˆY(k)_{ij ˆ}Y(k)_ij + (Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i ) − 2 ˆZ_ij(k)ˆY(k)_{ij µ} i + ˆZ(k)_{ij µ}iµi = ˆZ(k)_ij ((ˆY(k)_{ij − µ}_i)(ˆY(k)_{ij − µ}_i) + (Ip− ˆ(k)i ˆS oo(k) ij ) ˆ (k) i ). References

[1]D.M. Titterington, A.F.M. Smith, U.E. Markov, Statistical Analysis of Finite Mixture Distributions, Wiley, New York, 1985.

[2]G.J. McLachlan, K.E. Basford, Mixture Models: Inference and Application to Clustering, Marcel Dekker, New York, 1988.

[3]G.J. McLachlan, D. Peel, Finite Mixture Model, Wiley, New York, 2000.

[4]J.L. Schafer, Analysis of Incomplete Multivariate Data, Chapman & Hall, London, 1997.

[5]C.H. Liu, Efﬁcient ML estimation of multivariate normal distribution from incomplete data, J. Multivariate. Anal. 69 (1999) 206–217.

[6]Z. Ghahramani, M.I. Jordan, Supervised learning from incomplete data via an EM approach, in: J.D. Cowan, G. Tesarro, J. Alspector (Eds.), Advances in Neural Information Processing Systems, vol. 6, Morgan Kaufmann, San Francisco, 1994, pp. 120–127.

[7]A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm (with discussion), J. R. Statist. Soc. B. 39 (1977) 1–38.

[8]J. Diebolt, C.P. Robert, Estimation of ﬁnite mixture distributions through Bayesian sampling, J. R. Statist. Soc. B. 56 (1994) 363–375.

[9]M.A. Tanner, W.H. Wong, The calculation of posterior distributions by data augmentation (with discussion), J. Am. Statist. Assoc. 82 (1987) 528–550.

[10]M.D. Escobar, M. West, Bayesian density estimation and inference using mixtures, J. Amer. Statist. Assoc. 90 (1995) 577–588.

[11]S. Richardson, P.J. Green, On Bayesian analysis of mixtures with an unknown number of components, J. R. Statist. Soc. B. 59 (1997) 731–792.

[12]Z.H. Zhang, K.L. Chan, Y.M. Wu, C.B. Chen, Learning a multivariate gaussian mixture model with the reversible jump MCMC algorithm, Statist. Comput. 14 (2004) 343–355.

[13]P.J. Green, Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, Biometrika 82 (1995) 711–732.

[14]M. Stephens, Bayesian analysis of mixture models with an unknown number of components—an alternative to reversible jump methods, Ann. Statist. 28 (2000) 40–74.

[15]S. Fruhwirth-Schnatter, Markov Chain Monte Carlo estimation of classical and dynamic switching and mixture models, J. Amer. Statist. Assoc. 96 (2001) 194–209.

[16]S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 721–741.

[17]G. Celeux, M. Hurn, C.P. Robert, Computational and inferential difﬁculties with mixture posterior distributions, J. Amer. Statist. Assoc. 95 (2000) 957–970.

[18]D.B. Rubin, Inference and missing data, Biometrika 63 (1976) 581–592.

[19]R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing Data, second ed., Wiley, New York, 2002.

[20]K.E. Basford, G.J. McLachlan, Estimation of allocation rates in a cluster analysis text, J. Amer. Statist. Assoc. 80 (1985) 286–293.

[21]W.H. Edwards, H. Lindman, L.J. Savage, Bayesian statistical inference for psychological research, Psycol. Rev. 70 (1963) 193–242.

[22]A.E. Gelfand, A.F.M. Smith, Sampling based approaches to calculate marginal densities, J. Amer. Statist. Assoc. 85 (1990) 398–409.

[23]E. Anderson, The irises of the Gaspé Peninsula, Bull. Am. Iris Soc. 59 (1935) 2–5.

[24]R.A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugenics 7 (1936) 179–188.

[25]N.A. Campbell, R.J. Mahon, A multivariate study of variation in two species of rock crab of genus Leptograpsus, Aust. J. Zool. 22 (1974) 417–425.

(11)

[26]M. Stone, Cross-validatory choice and assessment of statistical predi-ction (with discussion), J. R. Statist. Soc. B. 36 (1974) 111–147.

[27]S.P. Brooks, A. Gelman, General method s for monitoring convergence of iterative simulations, J. Comput. Graph. Statist. 7 (1998) 434–455.

[28]R.A. Johnson, D.W. Wichern, Applied Multivariate Statistical Analysis, ﬁfth ed., Prentice-Hall, Englewood Cliffs, NJ, 2002.

[29]D. Peel, G.J. McLachlan, Robust mixture modeling using the t distribution, Statist. Comput. 10 (2000) 339–348.

[30]S. Shoham, Robust clustering by deterministic agglomeration EM of mixtures of multivariate t-distributions, Pattern Recognition 35 (2002) 1127–1142.

[31]T.I. Lin, J.C. Lee, H.F. Ni, Bayesian Analysis of mixture modelling using the multivariate t distribution, Statist. Comput. 14 (2004) 119–130.

[32]H.X. Wang, Q.B. Zhang, B. Luo, S. Wei, Robust mixture modelling using multivariate t distribution with missing information, Pattern Recognition Lett. 25 (2004) 701–710.

About the Author—TSUNG I. LIN received his B.A. in applied mathematics from National Chung Hsing University, Taiwan in 1993, the M.S. in

statistics from National Tsing Hua University, Taiwan in 1997 and Ph.D. in statistics from National Chiao Tung University, Taiwan in 2003. He is at present an assistant professor of National Chung Hsing University. Dr. Lin published papers in statistics and ﬁnance. His recent research includes computational statistics, robust mixture modelling and Bayesian analysis.

About the Author—JACK C. LEE received his B.A. in management from National Taiwan University, Taiwan in 1964, the M.S. in economics from the

University of Rochester in 1969 and Ph.D. in statistics from the State University of New York at Buffalo in 1972. Dr. Lee is a fellow of the American Statistical Association and an elected member of the International Statistical Institute. He is at present a University Chair Professor of National Chiao Tung University. Dr. Lee published over seventy papers in statistics, engineering and ﬁnance. His recent research includes multivariate analysis, speech recognition and ﬁnance.

About the Author—HSIU J. Ho received his B.A. in 2003 and M.S. in 2005, both in statistics from Tunghai University, Taiwan. He is currently a

research assistant under the supervision of Dr. Yi-Hau Chen in the Institute of Statistical Science at Academia Sinica, Taiwan. His research interests include pattern recognition and feature selection.