Supervised learning of multivariate skew normal mixture models with missing information

(1)

DOI 10.1007/s00180-009-0169-5

O R I G I NA L PA P E R

Supervised learning of multivariate skew normal

mixture models with missing information

Tzy-Chy Lin · Tsung-I Lin

Received: 9 November 2008 / Accepted: 26 August 2009 / Published online: 19 September 2009 © Springer-Verlag 2009

Abstract We establish computationally flexible tools for the analysis of multivariate

skew normal mixtures when missing values occur in data. To facilitate the computation and simplify the theoretical derivation, two auxiliary permutation matrices are incor-porated into the model for the determination of observed and missing components of each observation and are manifestly effective in reducing the computational complex-ity. We present an analytically feasible EM algorithm for the supervised learning of parameters as well as missing observations. The proposed mixture analyzer, including the most commonly used Gaussian mixtures as a special case, allows practitioners to handle incomplete multivariate data sets in a wide range of considerations. The meth-odology is illustrated through a real data set with varying proportions of synthetic missing values generated by MCAR and MAR mechanisms and shown to perform well on classification tasks.

Keywords Classifier· EM algorithm · Ignorable · Incomplete data · MSN model ·

Multivariate truncated normal

1 Introduction

Finite mixture models have become a flexible and powerful probabilistic learning tool for heterogeneous multivariate data and been used extensively in classification and

T.-C. Lin

Institute of Statistics, National Chiao Tung University, Hsinchu 301, Taiwan T.-I. Lin (

B

)

Department of Applied Mathematics and Institute of Statistics, National Chung Hsing University, Taichung 402, Taiwan

(2)

clustering. During the last two decades, the usefulness of Gaussian mixture (GMIX) (Pearson 1894) and Student’s t mixture (TMIX) models, seePeel and McLachlan

(2000),Shoham(2002),Shoham et al.(2003) andLin et al.(2004), are being more frequently applied in various fields such as pattern recognition, data mining, computer vision, signal and image processing, machine learning and bioinformatics, etc. For a comprehensive introduction to mixture models and their applications, see monographs byTitterington et al.(1985),McLachlan and Basford(1988),McLachlan and Peel

(2000),Frühwirth-Schnatter(2006), and the references therein. Recently, mixtures of univariate skew normal and skew t distributions as natural extensions of univariate Gaussian mixtures have been considered byLin et al.(2007a,b).

In practice data may exhibit highly asymmetric observations and thus statistical inferences drawn from the ordinary Gaussian assumptions may yield unreliable results. To reduce unduly skewness encountered in general practice, one commonly adopted approach is through the best known data-based power transformation proposed byBox and Cox(1964). Although such a treatment is very convenient to use, the achieve-ment of joint normality is rarely satisfied and the transformed variables become more difficult to interpret. Instead of applying transformations, there has been a growing interest in proposing a wider class of distributions, namely, the multivariate skew nor-mal (MSN) distribution, which contains an extra vector of parameters in regulating skewness and includes the Gaussian family as a special case.

The MSN distribution was originally studied byAzzalini and Dalla Valle(1996) and some further attractive features and applications are given inAzzalini and Capitanio

(1999). Based on this class of distributions, a number of extensions or alternative proposals have appeared during the last decade.Arellano-Vallea and Genton(2005) studied the family of fundamental skew normal (FUSN) distributions, giving a uni-fied scheme to obtain MSN distributions starting from symmetric ones. Subsequently,

Arellano-Valle and Azzalini(2006) provided a survey on some of its extensions and variants.Sahu et al.(2003) defined a new class of MSN distributions and remarked that this sort of formulation is more flexible in terms of adjusting the correlation structure than the MSN ofAzzalini and Dalla Valle(1996). Recently,Lin(2009a) introduced a new mixture modeling framework with component densities using the MSN distri-bution ofSahu et al.(2003) and showed its great flexibility in modeling asymmetrical data.

Learning mixture models from incomplete data has become a powerful tool to han-dle real-world multivariate data sets with complex missing patterns. The work was pioneered byGhahramani and Jordan(1994), who applied the expectation maximi-zation (EM) algorithm (Dempster et al. 1977) to compute maximum likelihood (ML) estimates of the GMIX model with arbitrary patterns of missingness.Lin et al.(2006) extended their approach by introducing some efficient learning strategies from both ML and Bayesian perspectives.Wang et al.(2004) presented an ordinary EM algo-rithm for ML estimation of TMIX models with missing information. Related work on using the parameter expanded Expectation Maximization (PX-EM) algorithm (Liu et al. 1998) for the supervised learning of TMIX models with incomplete data was done byLin et al.(2009).

In this paper, we consider the learning of multivariate skew normal mixture (MSN-MIX) models when missing values may occur in the data. The probability

(3)

distri-bution of a p-dimensional random vector X is MSN with location vectorξ ∈ Rp, scale covariance matrix (a p × p positive definite matrix) and skewness matrix (a p-dimensional diagonal matrix), denoted as X ∼ SNp(ξ, , ). The probability density function (pdf) of this distribution is

ψp(X | ξ, , ) = 2pφp(X | ξ, )p(−1(X − ξ)|), (1)

where = + 2, = (Ip+ −1)−1= Ip− −1 (Ipis a p-dimensional identity matrix) and the notationsφp( · | µ, ) and p( · | ), respectively, stand for the pdf of Np(µ, ) and cumulative density function (cdf) of Np(0, ).

According to Proposition 1 ofArellano-Valle et al.(2007), the MSN distribution has the stochastic representation

X= ξ + | ζ1| + 1/2ζ2,

whereζ1andζ2are two independent Np(0, Ip) random vectors. Let γ = | ζ1| =

(|ζ11|, . . . , |ζ1 p|). Thenγ represents a vector consisting of p independent standard

half-normal random variables. A further hierarchical representation of MSN can be written as

X | γ ∼ Np(ξ + γ , ),

(2)

γ ∼ T Np(0, Ip; R₊p),

whereR₊p denotes the Euclidean vector space of all p-tuples of positive real numbers and T Np(µ, ; A) denotes a p-variate truncated normal distribution for Np(µ, ) lying within the hyperplaneA.

In what follows we assume that the missing data are missing at random (MAR) with an ignorable mechanism (cf.Rubin 1976;Schafer 1997;Little and Rubin 2002). In this case, the missingness is unrelated to the missing values and likelihood infer-ence can ignore the missing data mechanism. For computational aspects, we offer an analytically tractable EM algorithm coupled with some useful model-based tools to handle data with general missing patterns in the class of MSNMIX model. Note that the proposed strategy is also valid if mechanism is missing completely at random (MCAR), which is a special case of MAR. To reduce complications during the EM procedure, we introduce two permutation matrices for indexing the observed and miss-ing components of each datum. Under this model, we also offer a conditional predictor to retrieve the missing components and a classifier for allocating partially observed vectors.

The outline of the paper is as follows. In Sect.2, we describe the model, establish the notation, and study some important statistical properties of the model. In Sect.3, a computationally feasible EM algorithm is used to compute the ML estimates from incomplete data. Statistical principles regarding classification and prediction of incom-plete features are also investigated. In Sect.4, the proposed methodologies are applied to a real data set with varying proportions of synthetic missing values. Some conclud-ing remarks are given in Sect.5, and the technical derivation is sketched in Appendix.

(4)

2 Skew normal mixtures with missing information

In the MSNMIX model, we let Y = (Y1, . . . , Yn) be a set of p-dimensional random samples arising from a population with g subclassesC1, . . . , Cg. That is, each Yjhas the density f(Yj|) = g i=1 wiψp(Yj | ξi, i, i), wi ≥ 0, g i=1 ωi = 1, (3)

wherei = Diag(λi) with λi = (λi 1, . . . , λi p)and the unknown parameter vector

 contains the mixing probabilities wi(i = 1, . . . , g −1), the elements of component locationsξi’s, the distinct elements of component scale covariance matricesi’s and the skewness vectorsλi’s. Note that the notationψp(· | ξi, i, i) is the MSN density defined in (1) and Diag(·) denotes a diagonal matrix created by extracting the main diagonal elements from a square matrix or the diagonalization of a vector. The mean and covariance of Yj are given by

E(Yj) = g i=1 wiµi, Cov(Yj) = g i₌₁ wi(1 − wi) µiµi + wiϒi − g i_{= j} wiwjµiµj, whereµi = ξi + √

2/πλi andϒi = i + (1 − 2/π)2i are the mean vector and

covariance matrix of S Np(ξi, i, i), respectively.

To pose model (3) into an EM framework, we introduce allocation variables Zj =

(Z1 j, . . . , Zg j), one for each individual Yj, whose role is to encode which

com-ponent has generated Yj. Specifically, the indicators Zj ( j = 1, . . . , n) are a g × 1 vector of binary variables, whose elements are

Zs j =

1 if Yj belongs to group s, 0 otherwise,

subject tog_i₌₁Zi j = 1. This implies Zj follows a multinomial random vector with 1 trial and cell probabilitiesw1, . . . , wg, denoted by Zj ∼ M(1; w1, . . . , wg).

A three-level hierarchical representation of (3) can be expressed by

Yj | (γj, Zi j = 1) ∼ Np(ξi+ iγj, i),

γj | (Zi j = 1) ∼ T Np(0, Ip; R₊p), (4)

Zj ∼ M(1; w1, . . . , wg),

for i = 1, . . . , g and j = 1, . . . , n. From (4), we declare the complete data vector to be (Y, Z, γ ), where Y = (Y₁, . . . , Yn), Z = (Z1, . . . , Zn) and

(5)

γ = (γ

1, . . . , γn). From (4), the likelihood function for  based on complete

data is Lc( | Y, Z, γ ) ∝ g i₌₁ n j₌₁ wiφp(Yj | ξi+ iγj, i) Zi j _. (5)

We are interested in ML estimation of model (3) when Y may be partially observed. The underlying missingness mechanism is assumed to be MAR.

FollowingLin et al.(2006), we partition Yj into two components(Yo_j, Ym_j), where Yo_j (po_j× 1) and Ym_j ((p − po_j) × 1) denote the observed and missing compo-nents of Yj, respectively. To facilitate the computation, we introduce two types of the binary indicator matrices, denoted by Oj (po_j× p) and Mj((p − po_j) × p) such that Yo_j = OjYj and Ym_j = MjYj, which can be extracted from a p-dimensional iden-tity matrix Ipcorresponding to row positions of Yo_j and Ym_j in Yj, respectively. It is straightforward to verify that (a) Yj = OjYoj+ MjYmj; (b) Oj Oj+ Mj Mj = Ip. Furthermore, we can establish the following results.

Theorem 1 Let Yj ∼

g

i₌₁wiψp(Yj | ξi, i, i), and let Yoj and Ymj be the

observed and the missing components corresponding to Yj, respectively. We have (a) The marginal density of Yo_j is _ig₌₁wiψpo_j(Yoj|ξ

o i j, ooi j, oo i j), where ξ o i j = Ojξi,ooi j = OjiOj andooi j = OjiOj.

(b) The conditional density of Ym_j given Yo_jis

f(Ymj | Yoj) = 2p g i=1 ˜wi jφp−po_j Ymj | ξmi j·o, mmi j ·o p i−1i (Yj − ξi) | i , where ˜wi j = wiφpo_j Yoj | ξoi j, ooi j / g h=1 whψpo_j Yoj | ξoh j, ooh j, ooh j , ξm· o i j = Mj ξi+ iCooi j(Yj− ξi) , mm·o i j = Mj(Ip− iCooi j)iMj, withoo_{i j} = OjiOj and Cooi j = Ojoo −1 i j Oj.

Theorem 2 Given (4), we have the following conditional distributions:

(a) The conditional distribution of Yo_j givenγj and Zi j = 1 is

Yo_j | (γj, Zi j = 1) ∼ Npo j µo i j, oo i j , whereµ_{i j}o = Oj(ξi+ iγj) and ooi j = OjiOj.

(6)

(b) The conditional distribution of Ym_j given Yo_j,γj, and Zi j = 1 is Ym_j | (Yo_j, γj, Zi j = 1) ∼ Np−po j(µ m_·o i j , mm_·o i j ), where µm·o i j = Mj(ξi + iγj+ iSooi j(Yj− ξi− iγj)), mm·o i j = Mj(Ip− iSooi j)iMj, and Soo_{i j} = O_j(OjiO_j)−1Oj.

(c) The conditional distribution ofγj given Yoj and Zi j = 1 is

γj | (Yoj, Zi j = 1) ∼ T Np iCooi j(Yj − ξi), Ip− iCi jooi; R₊p . Let E γj|Yoj, Zi j = 1 = ηi jand E γjγj|Yoj, Zi j = 1 = i j. Both of which are implicit functions of parametersξi,iandiand can easily be evaluated by using formulas (10) and (11) inLin(2009a). The following corollary is a direct implication of Theorem2.

Corollary 1 Recall that Yj = O_jYo_j+ M_jYm_j and O_j Oj+ M_j Mj = Ip. These

give rise to O_j Oj(Ip− iSoo_{i j}) = 0. Then we can obtain

(a) E(Yj | Yo_j, Zi j = 1) = ξi + iηi j + iSooi j(Yj − ξi − iηi j). (b) Cov(Yj|Yoj, Zi j=1)=(Ip−iSooi j) i+i(i j− ηi jηi j)i(Ip− Sooi ji) . (c) E(Yjγ_j | Yo_j, Zi j = 1) = (Ip− iSooi j)(ξiηi j+ ii j) + iS oo i jYjη_{i j}. For a p-dimensional observation with the probability of missingness greater than or equal to zero for each attribute, there are 2p_{− 1 unique patterns of missingness.} In general, completely missing pattern does not occur. This indicates that the missing rate should be smaller than(p − 1)/p. To lessen the computational load,Lin et al.

(2006) have described a simple and feasible procedure by rearranging Y according to unique missing patterns of data.

3 ML estimation via the EM algorithm

The EM algorithm ofDempster et al.(1977) has been widely used in literature to carry out ML estimation in a variety of incomplete data problems. We offer an efficient EM algorithm for learning model (3) from incomplete data. From (5), the log-likelihood function of based on complete data, aside from additive constant terms, can be written by c( | Yo, Ym, Z, γ ) = g i=1 n j=1 Zi jlogwi+ 1 2 g i=1 ⎛ ⎝log |−1 i | ⎛ ⎝n j=1 Zi j ⎞ ⎠ − tr ⎛ ⎝−1 i n j=1 Hi j ⎞ ⎠ ⎞ ⎠, (6)

(7)

where

Hi j = Zi j(Yj− ξi− iγj)(Yj − ξi − iγj). (7) At the kth iteration of the E-step, we need to calculate the Q-function, defined as

Q( | ˆ(k)) = E

c( | Yo, Ym, Z, γ ) | Yo, ˆ(k)

which is the conditional expectation of (6) with respect to the distribution of(Ym, Z, γ ) given the observed data Yoand the current estimate ˆ(k).

Let ˆZ_{i j}(k)denote the posterior probability that Yjarises from the i th component. By Bayes’ rule, at the kth iteration, we have

ˆZ(k) i j = Pr(Zi j = 1 | Yoj, ˆ (k) ) = ˆw (k) i ψpo j Yo_j | ˆξo_{i j}(k), ˆoo_{i j}(k), ˆoo_{i j}(k) g h=1 ˆw(k)h ψpo j Yo_j | ˆξoh j(k), ˆ oo_(k) h j , ˆ oo(k) h j , (8) where ˆξo_{i j}(k) = Ojˆξ(k)i , ˆ oo(k) i j = Oj ˆ(k)i Oj and ˆ oo(k) i j = Ojˆ(k)i Oj for i =

1, . . . , g and j = 1, . . . , n. Furthermore, it can be shown that the expected value of

(7) conditional on Yo_jand current estimates ˆ(k)is

ˆH(k)_{i j} _(ξ_i_,i) = ˆZi j(k) Ip− ˆ(k)i ˆS oo(k) i j ˆ(k)i + ˆY(k)_{i j} _{− ξ}_i ₋iˆη(k)_{i j} ˆY(k)_{i j} _{− ξ}_i₋iˆη(k)_{i j} + Â(k)i j − i ˆ(k)i j − ˆη(k)i j ˆη(k) i j Â(k) i j − i , (9) where ˆSooi j(k)= Oj Ojˆ(k)i Oj ₋₁ Oj, Â_{i j}(k)= Ip− î(k)ˆS oo(k) i j ˆ(k)i and ˆY(k)_{i j} = E Yj | Zi j = 1, Yo, ˆ (k) = ˆ(k)i ˆS oo(k) i j Yj + Ip− ˆ(k)i ˆS oo(k) i j ˆξ(k)i + ˆ (k) i ˆη(k)i j . (10)

Hence, the Q-function can be written by

Q( | ˆ(k)) = g i=1 n j=1 ˆZ(k) i j logwi +1 2 g i=1 ⎛ ⎝log |−1 i | ⎛ ⎝n j=1 ˆZ(k) i j ⎞ ⎠ − tr ⎛ ⎝−1 i n j=1 ˆH(k)_{i j} (ξi, i) ⎞ ⎠ ⎞ ⎠ . (11)

(8)

In summary, the EM algorithm can be implemented as follows:

E-step: Given  = ˆ(k), compute ˆZ(k)_{i j} , ˆH(k)_{i j} and ˆY(k)_{i j} for i = 1, . . . , g and

j= 1, . . . , n, by using Eqs.8–10, respectively.

M-step:

1. Update ˆw_i(k)by maximizing (11) overwi subject to their sum being unity, which gives ˆw(k+1)_i = n−1n j=1 ˆZ(k) i j . 2. Update ˆξ(k)_i by ˆξi(k+1)= ⎛ ⎝n j=1 ˆZ(k) i j ⎞ ⎠ −1⎛ ⎝n j=1 ˆZ(k) i j ˆY (k) i j − ˆ (k) i n j=1 ˆZ(k) i j ˆηi j(k) ⎞ ⎠ . 3. Update ˆ(k)_i by ˆi(k+1)= n j=1 ˆH (k) i j (ˆξ (k+1) , ˆ(k)) n j₌₁ ˆZ(k)i j . 4. Update ˆ(k)_i by ˆ(k+1)i = Diag ⎧ ⎪ ⎨ ⎪ ⎩ ⎛ ⎝ ˆi(k+1)−1 n j=1 ˆZ(k) i j ˆ (k) i j ⎞ ⎠ −1⎛ ⎝ ˆ(k+1)i −1 n j=1 ˆZ(k) i j ˆC(k)i j ⎞ ⎠ 1p ⎫ ⎪ ⎬ ⎪ ⎭, where ˆC(k)_{i j} = ˆ(k)i j − ˆη(k)i j ˆη(k) i j ˆA(k) i j + ˆη(k)i j ˆY(k)_{i j} − ˆξ(k+1)i , 1pdenotes a p-dimensional vector of ones and the operator ‘’ represents the elementwise product of two matrices with the same dimension.

Since the stability and monotone convergence of EM are maintained, the itera-tions are repeated until a suitable convergence rule is satisfied, e.g., ˆ(k+1)− ˆ(k) is sufficiently small. When the convergence is achieved, the resulting estimates are denoted by ˆ = ( ˆw1, . . . , ˆwg−1, ˆξ1, . . . , ˆξg, ˆ1, . . . , ˆg, ˆ1, . . . , ˆg). Therefore, the posterior probability of the Yj belonging to group i can be estimated by

ˆw∗i j = P(Zi j = 1|Yo, ˆ) = ˆwiψpo_j Yo_j | ˆξoi j, ˆ oo i j, ˆ oo i j g i=1 ˆwiψpo_j Yo_j | ˆξo_{i j}, ˆoo_{i j}, ˆoo_{i j}. (12)

(9)

According to the ML classification theory ofBasford and McLachlan(1985), Yj is assigned to group s if ˆw∗_{s j}> ˆw_{i j}∗ for i = 1, . . . , g and i = s. Consequently, the ML predictor for the missing component Ym_j is given by

ˆYm j = E Ymj|Yoj, ˆ = Mj g i=1 ˆwi j∗ ˆξi+ îˆηi j+ îˆS oo i j(Yj − ˆξi− îˆηi j) . (13) 4 Experimental results

For illustration purposes, we apply the techniques presented so far to a subset of the Australian Institute of Sport (AIS) data, including 13 physical attributes measured on 102 male and 100 female athletes, which are treated as two intrinsic classes. The data were originally reported byCook and Weisberg(1994) and have already been ana-lyzed byAzzalini and Dalla Valle(1996),Azzalini and Capitanio(1999) andAzzalini

(2005), among others. They pointed out the AIS data are better suited to the MSN dis-tribution than Gaussian, but neglected the situation where patterns of multimodality occur. In this example, we select three attributes: X1: body mass index (BMI), X2: the percentage of body fat (Bfat) and X3: lean body mass (LBM). Detailed explanations of these variables can be found athttp://en.wikipedia.org/wiki/Search.

Figure1depicts pairwise bivariate scatter plots of the data with superimposed con-tours of the fitted 2-component MSNMIX distribution. It can be observed from the figure that the scatter plots and fitted densities reveal a pattern of asymmetric bimo-dality for each pair of attributes except that the bimobimo-dality of BMI versus LBM is not apparent because they are highly correlated. The Pearson’s correlation coefficient between these two attributes is 0.71.

To conduct experimental studies, synthetic missing values are generated according to both MCAR and MAR mechanisms, as shown inZio et al.(2007). In the MCAR experiment, missing items are obtained by deleting at random r % of the experimental data where each datum retains at least one observed attribute. The missing rates of the synthetic data range from 0.1 up to 0.4 by increments of 0.1. In the MAR case, missing items are only drawn form the attributes(X1, X2) depending on the observed values of X3under the assumption that the higher the value of X3the lower is the nonresponse propensity. Let qi be the i th quartile of the empirical distribution of X3and xi be the observed value of Xi. The nonresponse probabilities for(X1, X2) are 0.25 if x3< q1, 0.2 if x3∈ [q1, q2), 0.15 if x3∈ [q2, q3) and 0.1 if x3≥ q3. Data were simulated with a total of 500 and 100 replications under the MCAR and MAR settings, respectively. A relative difference of 10−5in successive values of the log-likelihood is used as a stopping guideline for the EM algorithm.

We fit a MSNMIX model with density (1) to 500 synthetic MCAR data sets for

g = 1 and g = 2. Here the number of components g = 1 corresponds to the MSN

model (a special case of MSNMIX model with a single component) ofSahu et al.

(10)

BMI 10 20 30 40 70 100 Bfat 20 25 30 10 20 30 LBM

Fig. 1 AIS data: bivariate scatter plots and fitted 2-component MSNMIX contours. (Plus sign female,

f illed cir cle male, B M I body mass index, Bfat the percentage of body fat, LBM lean body mass)

MSNMIX model as written below

f(Yj|) = wf (Yj|ξ1, 1, 1) + (1 − w) f (Yj|ξ2, 2, 2) ( j = 1, . . . , 202), where ξi= ⎡ ⎣ξ_ξi 1_{i 2} ξi 3 ⎤ ⎦ , i= ⎡ ⎣σ_σi_i,11_,12 _σσi_i,12_,22 _σσ_ii,13_,23 σi_,13 σi_,23 σi_,33 ⎤ ⎦ i= ⎡ ⎣λi0,11 λi0,22 00 0 0 λi_,33 ⎤ ⎦ (i = 1, 2).

For comparison, we test the null hypothesis H0: g = 1 (MSN) versus the alterna-tive hypothesis H1: g = 2 (MSNMIX). The numbers of free parameters under H0and H1are 12 and 25, respectively. The likelihood ratio test (LRT) statistic, given by the difference in values of−2 times the log-likelihood between two test models, is used to judge which of the two models is more suitable for this data set. Figure2displays the histograms of converged log-likelihood values of the null and the alternative models along with a summarized box plot for their LRT statistics. It is readily seen that the LRT statistics are highly significant compared with theχ₁₃2 distribution for all cases.

(11)

(a) r=10% g=1 g=2 20 40 60 80 100 120 140 20 40 60 80 100 1 20 140 20 40 60 80 100 120 140 20 40 60 80 100 1 20 140 20 0 4 0 6 0 8 0 100 120 140 20 0 4 0 6 0 8 0 1 00 120 20 0 4 0 6 0 8 0 100 120 20 0 40 60 80 100 −1660 −1640 −1620 −1600 −1580 −1560 −1500 −1480 −1460 −1440 −1420 −1400 (b) r=20% g=1 g=2 −1340 −1320 −1300 −1280 −1260 −1240 (c) r=30% g=1 g=2 −1160 −1140 −1120 −1100 −1008 −1060 (d) r=40% g=1 g=2

Fig. 2 A comparison of converged log-likelihood values of the null(g = 1) and the alternative (g = 2)

models and their LRT statistics, where dotted li ne (=χ₁₃2(0.99) = 27.69 and dashed line (= χ₁₃2(0.95) = 22.36, for various proportions of missing values. (Replications=500)

To exemplify the predictive accuracies on the imputation of missing values, we compare the MSN and MSNMIX predictors; see Eq.13, together with the traditional randomization-based mean imputation (MI) predictor, known as a common heuristic by filling in a single value for each missing value with the observed sample mean of the associated attribute. As a measure of precision, the mean absolute error (MAE) and the mean absolute relative error (MARE) are used to evaluate the prediction discrepancy. They are defined as

MAE= 1 m n i=1 p j=1

|yi j− ˆyi j| and MARE =

1 m n i=1 p j=1 yi j− ˆyi j yi j ,

where m is the number of missing entries, yi jis the actual value andˆyi jis the respective predictive value.

Comparison results for both MCAR and MAR scenarios are listed in Table1. The relative improvement percentage (RIP) in Table1is defined as the percentage decrease in the relative prediction error when comparing MSN and MSNMIX predictors. In this

(12)

Ta b le 1 A comparison o f av eraged p rediction accuracies and the associated standard de viations in parentheses for three imputation m ethods with v arying p ro portions of missing v alues Mechanism M issing rate (%) M AE MARE MI MSN M SNMIX RIP (%) MI MSN M SNMIX RIP (%) MCAR 10 6.114 (0.733) 3.476 (0.495) 3.326 (0.499) 4. 32 0.243 (0.032) 0.154 (0.024) 0.141 (0.023) 8. 44 20 6.124 (0.493) 3.777 (0.321) 3.637 (0.337) 3. 71 0.246 (0.024) 0.166 (0.032) 0.155 (0.018) 6. 63 30 6.115 (0.380) 4.096 (0.300) 3.967 (0.319) 3. 15 0.244 (0.018) 0.177 (0.015) 0.167 (0.015) 5. 65 40 6.115 (0.302) 4.382 (0.284) 4.275 (0.307) 2. 44 0.245 (0.017) 0.190 (0.014) 0.182 (0.015) 4. 21 MAR 3 .845 (0.320) 2.572 (0.252) 2.436 (0.260) 5. 29 0.270 (0.029) 0.191 (0.029) 0.178 (0.031) 6. 87 The relati v e impro v ement percentage (RIP) is measured by [(MSN-MSNMIX)/MSN] × 100% MI mean imputation, MA E the m ean absolute error , MA R E the m ean absolute relati v e error

(13)

Table 2 A comparison of average misclassification rates (%) between GMIX and MSNMIX models with

standard deviations in parentheses

Mechanism Missing Trivariate case Bivariate case rate (%)

GMIX MSNMIX RIP (%) GMIX MSNMIX RIP (%)

MCAR 10 5.35 (0.015) 4.77 (0.010) 10.8 27.36 (0.129) 10.93 (0.022) 60.1 20 6.99 (0.024) 6.24 (0.014) 12.0 31.71 (0.136) 14.60 (0.028) 54.0 30 9.84 (0.034) 8.61 (0.018) 12.5 34.75 (0.123) 18.68 (0.036) 46.2 40 13.11 (0.037) 11.63 (0.023) 11.3 36.20 (0.105) 20.52 (0.020) 43.3 MAR 5.33 (0.009) 4.85 (0.009) 9.0 20.55 (0.120) 7.91 (0.013) 61.5 The relative improvement percentage (RIP) is measured by [(GMIX-MSNMIX)/GMIX]×100% N ote: The misclassification rates based on the MSNMIX classifier are all significantly lower than those of the GMIX classifier as measured by the Wilcoxon rank-sum test

study, we found that both model-based predictors substantially outperform MI for all cases. Furthermore, the MSNMIX predictor exhibits considerable promising accuracy in the prediction of missing values when compared with those of MSN imputations over a wide range of missing rates.

As another illustration, we compare the supervised learning of classification accu-racies between the GMIX and MSNMIX classifiers; see Eq. 12. Comparisons are made on the trivariate data and a bivariate sample on attributes BMI and LBM. Ta-ble 2 shows the average misclassification rates from these models. As seen in the table, the misclassification rates of the MSNMIX classifier are all smaller than those of the GMIX classifier, especially for the bivariate sample with RIPs ranging be-tween 43.3 and 61.5%. Alternatively, the Wilcoxon rank sum procedure (Hollander and Wolfe 1999) can be performed to test whether misclassification rate of MSN-MIX classifier, PS, is significantly lower than that of the GMSN-MIX classifier, PG. In other words, the null hypothesis is H0 : PS ≤ PG. If this p-value is less than 0.05, we can reject the above null hypothesis at the 5% significance level. Once again, the MSNMIX classifier gave highly significant reduction in misclassification rate (all p-values are far less than 0.05) as measured by the Wilcoxon rank sum test. These observations signify the MSNMIX model provides a sound basis for the classification of partially observed features.

5 Concluding remarks

We establish some properties related to the MSNMIX model in a missing information framework. The proposed model is very flexible in dealing with heterogeneous data that involve strong skewness and is robust to the presence of missing observations. We discuss in detail how the EM algorithm coupled with auxiliary matrices can be applied on learning models from incomplete data in an efficient manner. Experimen-tal results indicate that the MSNMIX model performs well for imputations as well as classification when asymmetric multimodality and missing outcomes simultaneously occur in the input data.

(14)

There are a number of possible extensions of the current work. We highlight that, with the growing advances of modern stochastic computing technology and inexpen-sive high-speed computer power, it is worthwhile to pursue a fully Bayesian treatment (e.g.,Hastings 1970;Tanner and Wong 1987;Diebolt and Robert 1994;Escobar and West 1995) in this context for enriching up-to-date account of the theory and applica-bility. Furthermore, it is also of interest to generalize all existing approaches to learning multivariate skew t mixture models (Lin 2009b;Pyne et al. 2009) from incomplete data.

Acknowledgments The authors would like to express his deepest gratitude to the Chief Editor, the Asso-ciate Editor and two anonymous referees for their valuable comments and suggestions that greatly improved this paper. This research was supported by the National Science Council of Taiwan (Grant No. NSC97-2118-M-005-001-MY2).

Appendix A: Proof of Theorem1

(a) Let Yj,ξi,i andi be partitioned as

Yj = Yo_j Ym_j = OjYj MjYj , ξi = ξo i j ξm i j = Ojξi Mjξi , i = oo i j omi j mo i j mmi j = OjiOj OjiMj MjiO_j MjiM_j , and i = oo i j 0po j×pmj 0pm j×poj mm i j = OjiOj 0po j×pmj 0pm_j×po_j MjiM_j . Thus, we obtain i = i + 2i = Oj i+ 2i O_j OjiM_j MjiOj Mj i+ 2i M_j = oo i j om i j mo i j mm i j .

Note thati andi are symmetric matrices. It follows thati,ooi j andmmi j are symmetric matrices andom_{i j} = mo_{i j} . Furthermore,

i−1_i = ⎡ ⎣ooi j oo−1 i j omi j mm·o −1 i j moi j oo −1 i j + oo −1 i j − oo i joo −1 i j omi j mm·o −1 i j −mm i j mm·o −1 i j moi j oo −1 i j mmi j mm·o −1 i j ⎤ ⎦ =Bi 1 Bi 2 ,

wheremm_{i j} ·o= mm_{i j} − mo_{i j} oo_{i j}−1om_{i j} . That is,

Bi 1= ⎡ ⎣ ooi j oo−1 i j om i j mm_·o−1 i j mo i j + Ipo_j −mm i j mm·o −1 i j moi j ⎤ ⎦ oo−1 i j

(15)

and Bi 2= −oo i joo −1 i j omi j mm i j mm_·o−1 i j .

It suffices to show that

Bi 1+ Bi 2moi j oo −1 i j = oo i j 0pm j×poj oo−1 i j .

Moreover, we have the following results:

i−1i i = Bi 1 Bi 2 oo i j 0 0 mm_{i j} =Bi 1ooi j Bi 2mmi j = oo i j oo−1 i j omi j mm·o −1 i j moi j + Ipo j oo−1 i j ooi j −ooi joo −1 i j omi j mm·o −1 i j mmi j −mm i j mm·o −1 i j moi j oo −1 i j ooi j mmi j mm·o −1 i j mmi j (A.1) and Bi 2mmi j ·oBi 2 = oo i joo −1 i j omi j mm·o −1 i j moi j oo −1 i j ooi j −ooi joo −1 i j omi j mm·o −1 i j mmi j −mm i j mm·o−1 i j mo i j oo−1 i j oo i j mm i j mm·o−1 i j mm i j . (A.2) Since

i + Bi 2mmi j ·oBi 2= Ip− i−1i i + Bi 2mmi j ·oBi 2, (A.3) substituting (A.1) and (A.2) into (A.3) leads to

i+ Bi 2mmi j ·oBi 2= Ipo_j − ooi joo −1 i j ooi j 0po_j×pm_j 0pm j×poj Ipmj . Thus, we have f(Yo_j, Ym_j|Zi j = 1, ) = 2pφpo j Yo_j|ξo_{i j}, oo_{i j} φp_−po j Ym_j|ξm_{i j}· o, mm_{i j} ·o ×p Bi 1 Yo_j − ξo_{i j} + Bi 2 Ym_j − ξm_{i j} | i , whereξm_{i j}· o= ξm_{i j}+ _{i j}mooo_{i j}−1 Yo_j− ξo_{i j} andmm_{i j} ·o= mm_{i j} − mo_{i j} oo_{i j}−1om_{i j} .

(16)

The marginal density of Yo_j is given by f(Yoj|Zi j = 1, ) = 2pφpo_j Yoj|ξoi j, ooi j ! φpm_j Ymj|ξmi j· o, mmi j ·o ×p Bi 1 Yoj− ξoi j + Bi 2 Ymj − ξmi j | i dYmj. Let z= Ym_j −ξm_{i j}· o, then Ym_j = z +ξ_{i j}m· o= z +ξm_{i j}+mo_{i j} oo_{i j}−1 Yo_j− ξo_{i j} . It can be shown thatφpm_j Ym_j|ξmi j· o, mmi j ·o = φpm_j z|0, mm_{i j} ·o andp Bi 1 Yo_j − ξi jo + Bi 2 Ym_j − ξm_{i j} | i = p Bi 1+ Bi 2moi j oo −1 i j Yo_j − ξo_{i j} + Bi 2z| i . By Lemma 2.1 ofArellano-Vallea and Genton(2005), we have

f(Yoj | Zi j = 1, ) = 2p_φ po_j Yo_j|ξo_{i j}, oo_{i j} × ! φpm_j z|0, mm_{i j} ·o p Bi 1+ Bi 2moi j oo −1 i j Yo_j− ξo_{i j} + Bi 2z| i dz = 2po jφ po_j Yo_j|ξo_{i j}, oo_{i j} po_j oo i joo −1 i j Yo_j− ξo_{i j} | Ipo_j − ooi joo −1 i j ooi j . Thus, Yo_j | (Zi j = 1, ) ∼ SNpo j ξo i j, ooi j, i joo . It implies that f(Yo_j | ) = g i=1 f(Yo_j | Zi j = 1, )p(Zi j = 1)= g i=1 wiψpo j Yo_j | ξo_{i j}, oo_{i j}, oo_{i j} . (b) By virtue ofφp(Yj|ξi, i) = φpo_j(Yoj|ξ o i j, ooi j)φp−po_j(Ymj|ξ m·o

i j , mmi j ·o), see

The-orem 2.5.1 ofAnderson(2003), we can deduce that

f(Ym_j | Yo_j) = 2p g i=1 ˜wi jφp−po_j(Ymj | ξmi j·o, i jmm·o)p(i−1_i (Yj − ξi) | i). Appendix B

The proofs of part (a) and part (b) are straightforward and hence are omitted. We only show the proof of part (c). From (4), we have the following densities

f(Yo_j|γj, Zi j = 1, ) ∝ |ooi j|−1/2exp −1 2 Yo_j− µo_{i j} oo−1 i j (Y o j− µ o i j) "

(17)

and f(γj|Zi j = 1) ∝ exp −1 2γ jγj " I_Rp +(γj). Multiplying f Yo_j|γ_j, Zi j = 1,  by f γ_j|Zi j = 1 gives f(Yo_j, γj|Zi j = 1, ) ∝ |oo i j|−1/2exp −1 2 Yo_j− µo_{i j} oo−1 i j Yo_j − µo_{i j} + γjγj " I_Rp +(γj) ∝ exp −1 2 γj− −1i (Yj− ξi) iSooi ji γj− −1i (Yj− ξi) " ×|oo i j|−1/2exp −1 2γ jγj " I_Rp +(γj) ∝ exp −1 2 γj− iCooi j(Yj − ξi) (Ip− iCooi ji)−1 γj− iCooi j(Yj− ξi) " ×|oo i j|−1/2exp −1 2(Yj− ξi) _Coo i j(Yj− ξi) " I_Rp +(γj),

where the last two equalities follow from

(Yo j− µ o i j) oo−1 i j Yo_j− µ_{i j}o + γjγj = (OjYj − Oj(ξi+ iγj))oo −1 i j (OjYj− Oj(ξi+ iγj)) + γjγj = (Yj− ξi − iγj)Ojoo −1 i j Oj(Yj− ξi− iγj) + γjγj = (Yj− ξi − iγj)Si joo(Yj− ξi− iγj) + γjγj = γj− −1i (Yj − ξi))iSooi ji(γj − −1i (Yj− ξi) + γjγj = γj− iCooi j(Yj− ξi) (Ip− iCooi ji)−1 γj − −1i (Yj− ξi) +(Yj− ξi)Cooi j(Yj− ξi).

Moreover, it is not difficult to show the identity|oo_{i j}| = |oo_{i j}||Ip− iCooi ji|. By Bayes’ rule, the conditional density ofγj given Yoj and Zi j = 1 is

f(γj|Yoj, Zi j = 1, ) ∝ |Ip− iCooi ji|−1/2 × exp −1 2 γj − qoi j)(Ip− iCooi ji)−1(γj − qoi j " I_Rp +(γj),

(18)

where qo_{i j} = iCoo_{i j}(Yj − ξi). This implies that γj|(Yoj, Zi j = 1, ) ∼ T Np iCooi j(Yj − ξi), Ip− iCooi ji, R₊p . References

Anderson TW (2003) An introduction to multivariate statistical analysis, 3rd edn. Wiley and Sons, New York

Arellano-Valle RB, Azzalini A (2006) On the unification of families of skew-normal distributions. Scand J Statist 33:561–574

Arellano-Valle RB, Bolfarine H, Lachos VH (2007) Bayesian inference for skew-normal linear mixed models. J Appl Stat 34:663–682

Arellano-Vallea RB, Genton MG (2005) On fundamental skew distributions. J Multivariate Anal 96:93–116 Azzalini A (2005) The skew-normal distribution and related multivariate families (with discussion). Scand

J Statist 32:159–200

Azzalini A, Capitanio A (1999) Statistical applications of the multivariate skew-normal distribution. J R Stat Soc Ser B 61:579–602

Azzalini A, Dalla Valle A (1996) The multivariate skew-normal distribution. Biometrika 83:715–726 Basford KE, McLachlan GJ (1985) Estimation of allocation rates in a cluster analysis text. J Am Stat Assoc

80:286–293

Box GEP, Cox DR (1964) An analysis of transformation. J R Stat Soc Ser A 26:211–252 Cook RD, Weisberg S (1994) An introduction to regression graphics. Wiley, New York

Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algo-rithm (with discussion). J R Stat Soc Ser B 39:1–38

Diebolt J, Robert CP (1994) Estimation of finite mixture distributions through Bayesian sampling. J R Stat Soc Ser B 56:363–375

Escobar MD, West M (1995) Bayesian density estimation and inference using mixtures. J Am Stat Assoc 90:577–588

Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, New York Ghahramani Z, Jordan MI (1994) Supervised learning from incomplete data via an EM approach. In:

Cowan JD, Tesarro G, Alspector J (eds) Advances in neural information processing systerms, vol 6. Morgan Kaufmann Publishers, San Francisco pp 120–127

Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biomet-rika 57:97–109

Hollander M, Wolfe DA (1999) Nonparametric statistical methods, 2nd edn. Wiley, New York

Lin TI (2009a) Maximum likelihood estimation for multivariate skew normal mixture models. J Multivariate Anal 100:257–265

Lin TI (2009b) Robust mixture modeling using multivariate skew t distributions. Stat Comput. doi:10.1007/ s11222-009-9128-9(in press)

Lin TI, Lee JC, Ni HF (2004) Bayesian analysis of mixture modelling using the multivariate t distribution. Stat Comput 14:119–130

Lin TI, Lee JC, Ho HJ (2006) On fast supervised learning for normal mixture models with missing infor-mation. Pattern Recogn 39:1177–1187

Lin TI, Lee JC, Hsieh WJ (2007a) Robust mixture modeling using the skew t distribution. Stat Comput 17:81–92

Lin TI, Lee JC, Yen SY (2007b) Finite mixture modelling using the skew normal distribution. Statist Sinica 17:909–927

Lin TI, Ho HJ, Shen PS (2009) Computationally efficient learning of multivariate t mixture models with missing information. Comp Stat 24:375–392

Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York

Liu CH, Rubin DB, Wu Y (1998) Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika 85:755–770

McLachlan GJ, Basford KE (1988) Mixture models: inference and application to clustering. Marcel Dekker, New York

(19)

Pearson K (1894) Contributions to the theory of mathematical evolution, Phi. Trans Roy Soc London A 185:71–110

Peel D, McLachlan GJ (2000) Robust mixture modeling using the t distribution. Stat Comput 10:339–348 Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier L, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jager PL, Mesirov JP (2009) Automated high-dimensional flow cytometric data analysis. Proc Natl Acad Sci USA 106:8519–8524

Rubin DB (1976) Inference and missing data. Biometrika 63:581–592

Sahu SK, Dey DK, Branco MD (2003) A new class of multivariate skew distributions with applications to bayesian regression models. Canad J Statist 31:129–150

Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall, London

Shoham S (2002) Robust clustering by deterministic agglomeration EM of mixtures of multivariate t-distributions. Pattern Recogn 35:1127–1142

Shoham S, Fellows MR, Normann RA (2003) Robust, automatic spike sorting using mixtures of multivar-iate t-distributions. J Neurosci Methods 127:111–122

Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation (with dis-cussion). J Am Stat Assoc 82:528–550

Titterington DM, Smith AFM, Markov UE (1985) Statistical analysis of finite mixture distributions. Wiely, New York

Wang HX, Zhang QB, Luo B, Wei S (2004) Robust mixture modelling using multivariate t distribution with missing information. Pattern Recogn Lett 25:701–710

Zio MD, Guarnera U, Luzi O (2007) Imputation through finite Gaussian mixture models. Comp Stat Data Anal 51:5305–5316