3. Models …
3.2 RLCA
To incorporate covariate effects into LCA, let (xi, zi) be the associated covariate vector for the ith person, where xi=[1,xi1,L,xip]T are predictors associated with
latent class Si, and zi=[zi1,…, ziM]; zim=[1, zim1,…, zimL]T with m=1,…M are covariates used to build direct effects on measured indicators. The sets of covariates may include any combination of continuous and discrete measures. To marginalize
the RLCA model (3.2) , we begin by assuming that the two sets of covariates are mutually independent. The basis RLCA equation can be stated as
∑ ∏∏
framework (McCullagh and Nelder, 1989). Often, (3.2) is implemented by assuming generalized logit (Agresti, 1984) link functions:If the regression coefficients in (3.3) or (3.4) are set as 0, model (3.2) reduces to models studied by Melton, Liang and Pulver (1994), Dayton and Macready (1998) or an ordinary latent class analysis (3.1).
Notice that in the conditional probability model (3.4), we allow unrestricted intercepts and level- and item-specific covariate coefficients, but we do not allowing the coefficients to vary across classes (i.e.,αqmk is dependent on m, k but independent of j). This constraint is logical if the primary purpose of modeling conditional probabilities is to prevent possible misclassification by adjusting for characteristics associated with item measurements. Three assumptions complete (3.2):
(C1)
Pr( Y
i1= y
1, L , Y
iM= y
M| S
i, x
i, z
i) = Pr( Y
i1= y
1, L , Y
iM= y
M| S
i, z
i)
;(C2)
Pr( S
i= j | x
i, z
i) = Pr( S
i= j | x
i)
;(C3)
∏
=
=
=
=
=
M
m
im i m im i
i M iM
i
y Y y S z Y y S z
Y
1 1
1
, , | , ) Pr( | , )
Pr( L
.For more detail on model assumptions, identifiability and parameter estimations, readers may reference Huang and Bandeen-Roche (2004).
4. Parameter estimations by clustering analysis
The parameters in (3.2) are typically estimated by maximum likelihood (ML) for a fixed number of classes, J. Viewing the class membership Si as unobservable, the LCA model (3.1) and RLCA model (3.2) becomes a typical incomplete-data problem.
Goodman (1974) provided an excellent maximum likelihood strategy for estimating model parameters in (3.1), and Hung and Bandeen-Roche (2004) had successfully used the Expectation-Maximization (EM) algorithm (Dempster, Laird, & Rubin, 1997) to computing ML estimates of the parameters in (3.2) and created a powerful computer module to implement the proposed latent class model (3.2). However implementing the EM algorithm to estimate parameters in finite-mixture models is typically time-consuming. Therefore we propose an alternative clustering analysis strategy to predict parameters in (3.1) and (3.2).
4.1 Latent class membership estimations for LCA
Latent class analysis aims to classify objects based on their responses to a set of categorical items. The basic model postulates an underlying categorical latent variable Si =1,…,J for individual i; within any category of the latent variable, the measured indicators are assumed to be independent of one another. Therefore if we can estimate the unobservable class membershipSi based on model assumption, then it is easy to predict the parameters in (3.1) by viewing the estimated class membership as known variable. We propose the following strategy to estimateSi.
Here we apply the concept of k-means (MacQueen, 1967) and agglomerative hierarchical methods to group the objects. Notice that rather than grouping the objects into J subgroups such that the objects in one subgroup are “far from” the objects in the other, we try to group objects such that observed variables are statistically
independent within latent classes. Therefore we use sample correlation or sample covariance as the distance in k-means and agglomerative hierarchical methods, and the concept of the ’ loss information’ and ‘minimum loss of information’ in ward’s hierarchical clustering method is included.
First, we introduce how to calculate the sample correlation and sample covariance matrix used in the k-means and agglomerative hierarchical clustering algorithms. For individual i, we transform the M observable polytomous outcome indicators (Yi1,L,YiM)to the following dummy variables: covariance matrix. Various elements of the variance-covariance matrix of measured indicators are:
These variances were estimated by replacing the probabilities with the sample averages. From sample covariance matrix, we can also calculate the sample
following are the steps in the k-means and agglomerative hierarchical clustering
algorithm separately.
The K-means algorithm:
1. Partition the objects into K initial clusters.
2. Proceed through the list of objects, assigning an object to the cluster which reaching ‘minimum loss of independence’.
3. Repeat Step2 until no more reassignments take place.
In Step 1, we first specify K initial centroids (seed points) and then proceed through the list of objects, assigning an object to the cluster whose centroid (mean) is nearest. (Distance is computed using Euclidean distance.) To reach enough sample size such that the sample covariance/correlation is meaningful, it is necessary to adjust the number of objects in each initial cluster. Therefore, once existing an one initial cluster including members less than we expect, we repartition the objects
“randomly” and “evenly” into K initial clusters.
Next, we shall introduce what the minimum loss of independence is in Step2. For a given cluster k, let MCovk be the mean of the absolute values of entries in non diagonal blocks of sample correlation matrix (or sample covariance matrix) for the M observable polytomous outcome indicators. For a given object, if it is assigned to some cluster j, we define the loss of independence LoI as the sum of the j MCovk or LoIj =MCov1(j) +MCov(j)2 +L+MCov(j)K , where (j)
MCov is the mean of the k
absolute values of non-diagonal-block entries of Y~ )
Cov( i after the object being assigned to cluster j . For a given object, after assigning through K clusters, we can get LoI , j= 1,…, K. The smaller the value of j LoI is, the more independent the j
Figure 1: An example of k-means algorithm procedure.
Step1: Partition the 9 objects into 3 initial clusters.
Step2: What cluster will the object 1 be assigned to?
Assigning the object 1 to the cluster 1,2 and 3 separately, we can get LoI1, LoI2 and LoI3. Assign the object 1 to the cluster which reaching
“minimum loss of independence”.
Proceed through the objects 2-9, repeat above procedure.
Step3: Repeat Step2 until no more reassignments take place.
observed variables for objects within cluster j are. So, we take the minimum LoI as j the ‘minimum loss of independence’ and assign a given object to the cluster
1. Start with N clusters, each cluster consists of a single object.
2. If there are some objects whose M items are all the same, then we merge them together. Otherwise, skip this step.
3. The union of every possible pair of clusters is considered, and the two clusters U and V whose combination results in the minimum loss of independence are merged. Label the newly formed cluster (UV).
4. Repeat Step 3 until all objects will be in a single cluster after the algorithm terminates. Record the identity of clusters that are merged and the levels (minimum loss of independence) at which the mergers take place.
Now we will introduce the minimum loss of independence in Step3 in the agglomerative hierarchical clustering algorithm. For currently K clusters, if a cluster U is merged to the other cluster V, then the K clusters decrease to K-1 clusters. We defined the loss of independence for cluster U being merged to cluster V, LoI(V, U), as the sum of MCovk(defined in the K-means algorithm) of each cluster:
U) (V, K U)
(V, 1 U U)
(V, 1 -U U)
(V, 1 U)
(V, MCov MCov MCov MCov
LoI = +L+ + + +L+ .
For currently K clusters, we can get C K2 LoI(V, U), where (V, U) belong to the union
of every possible pair of clusters. The smaller the value of LoI(V, U) is, the more
independent the observed variables for objects within the newly formed cluster (UV) are. So, we take the minimum LoI(V, U) as the ‘minimum loss of independence’ and merge the two clusters U and V whose combination results in the minimum loss of independence. An example of hierarchical algorithm procedure can be found in Figure2.
The results of agglomerative hierarchical clustering method can be displayed as a dendrogram. The vertical axis gives the values of minimum loss of independence at which the mergers occur.
4.2 Latent class membership estimations for RLCA
The k-means and agglomerative hierarchical clustering algorithms are based on the model (3.1) where no covariates are incorporated. The two algorithms also work for
initial clusters 1
Figure 2: An example of hierarchical algorithm procedure.
Step1: Star with 4 initial clusters, each cluster consists of a single object.
Step2: Which pair of clusters will be merged?
Consider the union of 6 (=C24) possible pair of clusters, we can get LoI(1,2), LoI(1,3), LoI(1,4), LoI(2,3), LoI(2,4)and LoI(3,4). Merge the pair of clusters whose combination results in the “minimum loss of
independence”.
Step3: Repeat Step2 until all objects will be in a single cluster.
the model (3.2) under eliminating the covariate effects (Huang2005), hence
“marginalize” the model (3.2).
The key to marginalizing over zi is that the process must yield random variables that follow a finite mixture distribution that is both independent of zi and has J mixing components. One strategy for achieving such marginalization can be motivated by the properties of added variable plots for linear regression models. The conditional probabilities (3.4) can be viewed as fitting a logistic regression of Yim on Si adjusting for zim, separately for each m. Next the problem becomes how to
Under polytomous item responses, the pseudo-residual of ith participant’s mth response item is marginalization process can be found in Huang (2005) and in section 2.3 of this thesis.
We can classify objects based on the new response variables Rimp to a set of
agglomerative hierarchical clustering algorithms in section 4.1, besides the estimation
Denote the estimated latent class as Sˆi for individual i. We transform (3.3) and (3.4) as the following form:
Then, we can easily predict the parameters in (3.2) by multinomial logistic regression (4.5) and (4.6).
5. Simulation study
The simulation study aims to examine the performance of the proposed approach.
5.1 Generated data from the RLCA model
Here, three different RLCA (3.2) models were simulated. One was a three-class RLCA with five two-level measured indicators, two covariates associated with conditional probabilities, and two covariates associated with latent prevalences (i.e., J=3, M=5, K1 = L=K5 =2, P=L=2). Another was a six-class RLCA with five three-level measured three-level measured indicators, two covariates associated with conditional probabilities, and two covariates associated with latent prevalences (i.e., J=6, M=5, K1 = L=K5 =3, P=L=2). The other was a two-class RLCA with the same indicators and covariate setting as the six-class model (i.e., J=2, M=5,
3 K
K1 = L= 5 = , P=L=2). For each model, the model parameters 1}
-J , 1, j ,
{βpj = L for each p∈{0,1,L,P}, {γ jmk ,j=1,L,J} for all m,k, and 1)}
-(K , 1, k M;
, 1, m ,
{αqmk = L = L m for all q, were given. Table 1~6 shows the values of the model parameters for the three model separately. We got the covariates of 3-class model from the subjects who joined the Multidimensional Psychopathological Study on Schizophrenia (MPSS) or the Study on Etiological Factors of Schizophrenia (SEFOS). We got the covariates of 6-class model and 2-class model from the subjects who joined the Multidimensional Psychopathology Group Research Projects (MPGRP), MPSS or SEFOS. In each model, the covariates associated with conditional probabilities include variables of sex and age (year), and the covariates associated with latent prevalences include variables of occupation (with versus without occupation) and dprime, which is the sensitivity index of the
We fit each model under several different sample sizes. For the three-class RLCAs, the selected sample sizes were 100 and 500, which gave roughly 3 and 16 individuals per parameter of RLCA (3.2), respectively. For the six-class RLCAs, the selected sample sizes were 300 and 1000, which gave roughly 3 and 10 individuals per parameter, respectively. For the two-class RLCAs, the selected sample sizes were 150 and 700, which gave roughly 3 and 16 individuals per parameter, respectively.
The observable measurements Yi were then generated from each different model structure with 100 replications.
5.2 Simulation results
In each case, the results of simulation study are represented in five tables which include the average parameters estimates (listed in two tables), average conditional probabilities, average latent prevalences and average correlation coefficients for 100 replications, separately. We shall explain these results later. The simulation results for 3-class model with 100 sample sizes are presented from Table 7 to Table 11. The simulation results for 3-class model with 500 sample sizes are presented from Table 12 to Table 16. The simulation results for 6-class model with 300 sample sizes are presented from Table 17 to Table 21. The simulation results for 6-class model with 1000 sample sizes are presented from Table 22 to Table 26. The simulation results for 2-class model with 150 sample sizes are presented from Table 27 to Table 31. The simulation results for 3-class model with 750 sample sizes are presented from Table 32 to Table 36. According to Table 7 ~ Table 36, we can see that these results of the k-means method using correlation coefficients measurement are similar to those of k-means method using covariance measurement. So, we shall discard the results of k-means method using covariance measurement in the following discussion.
First, we discuss the simulation results which are presented from Table 12 to
Table 16 of 3-class model with 500 sample sizes.
Average parameters estimations:
Table 12 and Table 13 under the column “TRUE” include all
{
βpj,γjmk,αqmk}
insimulated data. All average of
{
βpj,γjmk,αqmk}
estimates got from the k-means method using correlation coefficients measurement (K_Corr) and covariance measurement (K_Cova) separately and the hierarchical method using covariance measurement (H_Cova) are also shown in Table 12 and Table 13. Why not correlation coefficients measurement for Hierarchical method? At the initial stage, correlation coefficients in two objects are always one. Table 12 and Table 13 can demonstrate that the parameters estimates got from the k-means method are not bad compared to the true parameters. But the parameters estimates got from the hierarchical procedure are poor. The bad performance of the hierarchical procedure is result from there is no provision for a reallocation of objects that may have been “incorrectly” grouped at an early stage. Furthermore, the hierarchical procedure is sensitive to cluster structure.This means that hierarchical procedure have the chance to perform more well only when there is clear cluster structure than when there is no clear cluster structure.
Table 12 and Table 13 also include the standard errors of parameters estimates in doing multinomial regressions, (4.1) and (4.2), and the average sample standard errors of the parameters estimates for 100 replications. The sample standard errors of the estimates for 100 replications include the variation of doing multinomial regression and creating cluster membership. Because we use the multinomial regression to estimate parameters under the assumption of known cluster membership, the standard errors of parameters estimates in doing multinomial regression did not include the variations of creating cluster membership. Therefore, the standard errors of parameters estimates in doing multinomial regressions should be smaller than the
sample standard errors of the estimates for 100 replications. This is demonstrated in Table 12 and Table 13. However this is not demonstrated in Table 7 and Table 8 for the 3-class model with 100 sample sizes which gave very few individuals per parameter. For the sparse data, the estimated standard errors of parameters estimates in doing multinomial regressions are not accurate. Therefore, the standard errors of parameters estimates in doing multinomial regressions are not always smaller than the sample standard errors of the estimates over 100 replications for the 3-class model with 100 sample sizes.
Average conditional probabilities:
Table 14 under the column “TRUE” displays the RLCA conditional probabilities evaluated at the sample means of the incorporated covariates:
,
The average of estimated conditional probabilities over 100 replications with k-means (K_Corr and K_Cova) and hierarchical (H_Cova) methods are also shown in Table 14. The estimated conditional probabilities for k-means and hierarchical methods are:
Overall, the average conditional probabilities for the k-means method are more closed to the true conditional probabilities than the average conditional probabilities for the hierarchical method.
Average latent prevalence:
Table 15 under the column “TRUE” displays the sample average of the RLCA prevalences:
The average of estimated prevalences over 100 replications with k-means (K_Corr and K_Cova) and hierarchical (H_Cova) methods are also shown in Table 15.
The estimated prevalences are:
Overall, the average latent prevalences for the k-means method are more closed to the true latent prevalences than the average latent prevalences for the hierarchical method.
Average correlation coefficients:
We evaluated theMCovkof the objects in the same cluster k. Table 16 displays the average of MCovkover 100 replications in each cluster k. As expected, the k-means method resulted smaller average correlation coefficients than the hierarchical method.
Next, for the 6-class model with 1000 sample sizes, we shall discuss the simulation results which are presented from Table 22 to Table 26. These tables show that the results of whether the k-means procedure or the hierarchical procedure are poor obviously comparing to the 3-class model with 500 sample sizes. Figure 3 shows the dendrogram of the hierarchical procedure for 6-class model with 1000 sample sizes. The dendrogram indicates the cluster structure in 6-class model with 1000 sample sizes. Therefore we guess the objects in 6-class model with 1000 sample sizes should be divided to two clusters, which is demonstrate in the 2-class model. For the 2-class model with 750 sample sizes, the simulation results, which are presented from
Table 32 to Table 36, are much better than the 6-class model with 1000 sample sizes.
To go back to the Figure 3, the hierarchical procedure produces inversions (Morgan, Byron and Andrew P.G., 1995). An inversion occurs when an object joins an cluster at smaller covariance than that of a previous consolidation.
When we use maximum likelihood to estimate the parameters in (3.2), the maximum likelihood estimation (MLE) is relative to the number of individuals given in per parameter. For the spare data, which gave less individuals per parameter, the MLE can not be obtained or the MLE is not a good estimation .For the three models, 3-class RLCA with 100 sample sizes, 6-class RLCA with 300 sample sizes and 2-class RLCA with 150 sample sizes, which gave less individuals per parameter, the simulation results are not worse than those that gave more individuals per parameter.
It demonstrates that our clustering procedure is irrelative to the number of individuals given per parameter.
6. Discussion
The k-means and hierarchical approaches are alternatives in estimating parameters in RLCA. Overall, in our simulation study, the k-means approach performed well, but the hierarchical approach didn’t perform well. At the early stage of hierarchical approach, each cluster contains only very few objects (i.e. at the initial stage, only one object for each cluster). It is not appropriate to use the sample covariance to represent the association between two random variables when the sample size is small. The wrong reallocation of objects at an early stage will result in wrong reallocation of objects at following stage. For the improvement of the proposed hierarchical approach, it may be a good idea to choice alternative measurements of the association between two random variables.
References:
Agresti, A. (1984). Analysis of Categorical Data. New York: John Wiley and Sons.
Bandeen-Roche, K., Huang, G.H., Munoz, B., & Rubin, G.S. (1999). Determination of risk factor associations with questionnaire outcomes: A methods case study.
American Journal of Epidemiology, 150, 1165-1178.
Bandeen-Roche, K., Miglioretti, D.L., Zeger, S.L., & Rathouz, P.J. ( 1997). Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association, 92, 1375-1386.
Cook, R.D., & Weisberg, S. (1982). Residuals and Influence in Regression. London:
Chapman Hall.
Dayton, C.M., & Macready, G.B. (1988). Concomitant-variabele latent-class models.
Journal of the American Statistical Association, 83, 173-178.
Dempster, A.P., Laird, N.M., & Rubin, D.B. (1997). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-38.
Goodman, L.A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61, 215-231.
Green, B.F. (1951). A general solution of the latent class model of latent structure analysis and latent profile analysis. Psychometrika, 16, 151-166.
Huang, G.H. & Bandee-Roche, K. (2004). Building an identifiable latent variable model with covariate effects on underlying and measured variables.
Psychometrika, 69, 5-32.
Huang, G.H. (2005). Selecting the number of classes under latent class regression: A factor analytic analogue. Psychometrika, 70, 325-345.
Lazarsfeld, P.F. & Henry, N.W. (1968). Latent Structure Analysis. New York:
Houghton-Mifflin.
MacQueen, J.B. (1967). “Some Methods for Classification and Analysis of Multivariate Observations.” Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, Berkeley, CA: University of California Press, 281-297
McCullagh, P., & Nelder, J.A. (1989). Generalized Linear Modeels, 2nd edition.
London: Chapman and Hall.
Melton, B., Liang, KY., & Pulver, A.E. (1994). Extended latent class approach to the study of familial/sporadic forms of a disease: Its application to the study of the heterogeneity of schizophrenia. Genetic Epidemiology, 11, 311-327.
Morgan, Byron J.T., and Andrew P.G. Ray. (1995) “Non-uniqueness and Inversions in Cluster Analysis.” Applied Statistics, 44, 117-134.
Rosvold, H.E., Mirsk, A.F., Sarason, I., Bransome Jr, D.D. Bech, L.H. (1956). A continuous performance test of brain damage. J. Consult. Psychol., 20, 343-350.
Tryon, R.C. (1939). Cluster Analysis. Edwards Brothers.
Van der Heijden, P.G.M., Dessens, J. & Bökenholt, U. (1996). Estimating the
Van der Heijden, P.G.M., Dessens, J. & Bökenholt, U. (1996). Estimating the