Let (Yi1,,YiM) denote a set of M observable surrogates and S denote the i unobservable class membership, for the ith individual in a study sample of N samples. Unlike traditional LCA model, we think that some surrogates have no difference among unobservable latent classes. We call these surrogates as “ noisy surrogates “. The other surrogates that have different distributions in different latent classes are called “ clustering surrogates “. We hope to find the noisy surrogates and exclude their influences in estimating latent classes. So, under this idea, we let
finite mixture model will be completed by two assumptions:
Heuristically, j is the population prevalence of class j, and pmkj is the probability of
an individual in class j being at levels k of Yim 2 , and we do not explore the influence of Yim 1 in the following article.
Some authors have extended the finite mixture model to describe the effects of measured covariates on the underlying mechanism and/or on measured surrogate distributions within latent levels. One can summarize the effect of risk factors on the underlying mechanism by allowing covariates xi
1,x ,...,i1 xip
T to be functionally related to latent class S (Dayton i& Macready, 1998; Bandeen-Roche et al., 1997; Huang& Bandeen-Roche, 2004). And we implement the generalized linear framework (McCullagh & Nelder, 1989) to incorporate covariate effects into S : i To adjust for characteristics associated with surrogates, hence prevent possible misclassification of underlying variable categories, we can incorporate individual-level independent variables into the within-class distributions of measured surrogates (Melton,Liang, & Pulver, 1994; Huang, & Bandeen-Roche, 2004; Muthen, & Muthen, 2007).
Let zi
zi1,...,ziM
T with zim
1,zim1,...,ximL
T,m1,...,M be covariates used to build direct effects on measured surrogates within latent classes for the ith individual. Whensurrogates are ordinal or categorical variables, we assume that
If surrogates are continuous variables, we assume that
Yim(2) Si j z, im
~ Normal
mj
zim ,m2
, and conditional distribution models (3.3) and (3.4), we allow unrestricted intercepts, but we do not allow the covariate coefficients to vary across classes (i.e., m2, lm, and lmk, l1,...,L are independent of j). This constraint is logical if the primary purpose of modeling conditional probabilities is to prevent possible misclassification by adjusting for characteristics associated with surrogates. In addition, after adjusting covariate effects, the conditional independence assumption is also conditioning on z , that is i4 Parameter Estimation by Clustering Algorithm
Parameters in RLCA model are typically estimated using the EM algorithm (Goodman, 1974; Bandeen-Roche et al., 1997; Huang & Bandeen-Roche, 2004). Although this method has notable advantages (e.g., obtaining consistent and asymptotically normally distributed estimations, and directly providing standard error estimates for parameters), it can be vulnerable to the violation of model assumptions and be difficult to converge when fitting models with large numbers of surrogates and/or latent classes. Here, we propose an alternative strategy for estimating parameters. The proposed method consists of two stages: first, the alternate k-means method used in cluster analysis can find some noisy surrogates and implemented to estimate the underlying latent class membership. Second, the estimated class membership is treated as a known variable and other parameters are then estimated.
4.1 Latent class membership estimation when not incorporating covariate effects Finite mixture analysis is a useful tool to classify objects based on there responses to a set of surrogates. The basic model postulates an underlying categorical latent variableSi
1,...,J
, and, within any category of the latent variable, measured clustering surrogates are assumed to be independent of one another, and noisy surrogates are assumed to be no difference in each class when given clustering surrogates. But when we want to control more than one assumption, the traditional k-means algorithm will fail to work. So, we proposed the alternate k-means clustering method and to estimate S by applying this i method to find noisy surrogates, and to group the objects into J subgroups such that objects in one subgroup will have a set of statistically independent clustering surrogates. Unlike the traditional EM approach that intends to derive the grouping of objects under the assumption, the proposed method tries to find the “optimal” grouping that is the most satisfying of the assumption.4.1.1 The measurement for complete the assumption A1
The assumption (A1) means the clustering surrogates should be independent when they in the same latent class. So, we just to use clustering surrogates Y(2) to calculate the sample covariance matrix. For continuous surrogates, the sample covariance matrix is a M2M2 vector with elements being the indicators of each category:
The sample covariance matrix is obtained by replacing the probabilities with the sampleaverages. Let ACovj be the average of absolute values of entries in off diagonal elements (continuous surrogates) / blocks (polytomous surrogates) of the sample covariance matrix using objects in class j. Then, we define the “loss of independence” as
1
The assumption (A2) means the conditional expectations in any group should be equal, that is, E Y
i 1 |Yi 2 ,Si 1
E Y
i 1 |Yi 2 ,Si J
E Y
i 1 |Yi 2
, and we use a non -parametric method to evaluate the conditional expectation. In order to complete our algorithm, we need to create a measurement, which is called the between class variation.For the continuous surrogates, using the “nearest neighbor” approach, we define
2 2 2
performing alternate k-means clustering. The “between class variation” is then defined as should represent as a vector with elements being the indicators of each category:
1
1 1 1
sparse k-means clustering method (Daniela M. Witten, &Robert Tibshirani, 2009). Let 2 1 2 estimated class membership for individuals and surrogates:
IK1. Randomly partition the objects into j initial classes.
IK2. Let all the surrogates be clustering surrogates. Proceed through the list of objects, assigning objects to latent classes with the "loss of independence" as the distance measure.
IK3. Randomly assign the surrogates to the clustering group with probability 0.8 and to the noisy group with probability 0.2.
IK4. Fix the object class obtained from IK2. Proceed through the list of surrogates, assigning surrogates to clustering or noisy group with the pLoI as the distance measure.
IK5. Fix the surrogate group obtained from IK4. Proceed through the list of objects, assigning objects to latent classes with the pLoI as the distance measure.
IK6. Iterate IK4 and IK5, until the surrogate group assignment convergence. (i.e., there is no surrogate changing group)
In the algorithm IK4 and IK5, we use an standard k-means clustering method to assigning an object to the class and assigning surrogates to clustering/noisy group with the pLoI as the distance measure. The following algorithm describes how the standard K-means clustering method work:
K1. First, all objects (or surrogates) are partitioned into K initial clusters.
K2. Proceed through the list of objects (or surrogates), assigning an object (or surrogates) to
the cluster where the minimum pLoI is reached.
K3. Repeat step 2 until no more reassignments take place.
The flow chart for alternate k-means and standard k-means algorithm are showing in Figure 2 and Figure 3.
4.1.5 Estimation of tuning parameter
Our alternate k-means algorithm is sensitive to the tuning parameter . We have to choose an appropriate value of . Here, we propose an idea to select this parameter. First, we calculate the loss of independence LoI 2
Y on Y 2 and the between class variation
1 2
BCVY Y on Y 1 given Y 2 after the algorithm step IK1 and IK2. Then, we set LoI
BCV . This setting can reduce the effect resulting from the difference of these two values.
We believe the large difference between LoI 2
Y and BCV 1 2
Y Y will make the algorithm failed, and we find the appropriate not only shrinks the difference of two values, but also makes a good prediction result.
4.2 Latent class membership estimation when incorporating covariate effects
The alternate k-means clustering algorithms are based on the assumption (A1) and (A2).
If covariates z are incorporated into the conditional distributions as in model (3.3) and im (3.4), the conditional independence assumption is also conditioning on incorporated covariates (i.e., the assumption (3.5)). To apply these algorithms to model (3.3) and (3.4), one would need to “eliminate” the covariate effects, hence “marginalize” model (3.3) and (3.4).
Here, we adopt the marginalization process develop in 3.3.1 of (Huang, 2005). To present the process, we first reparameterize models (3.3) and (3.4) as
conditional distributions by treating these residuals as new response variables and regressing them on S . Therefore, the conditional independent assumption (3.5) is considered satisfied if i objects belonging to the same latent class have a set of M statistically independent 2 residuals.
Now, we consider Y
Yi1,...,YiM
Yi 1,Yi 2
. When Yim's are continuous, the typical residuals of linear regressions R (i.e., the differences between observed responses im and their modeled predictors) are computed. When Yim's are categorical, the problem becomes how to calculate residual from the generalized linear model
1
im im im im
R Cov Y Y p
,(4.15) where Yim is as defined in section, pimE Y
im zim
, and “hat” denotes the estimated values based on (4.13). The pseudo-residual (4.14) is defined by analogizing the alternately reweighted least-squares of generalized linear models with the least-square estimates of linear regressions (Landwehr, Pregibon, & Shoemaker, 1984; Huang, 2005). We then classify objects based on new response variables R (continuous surrogates) or im Rim (categorical surrogates) as done in the previous subsection.5 Classification Using Finite Mixture Models
In many researches, it is major interest to predict new observations’ unknown disease statuses based on their measurements on surrogates. Some literature develops the method to create the classification rules (Huang, Wang &Hsu, manuscript), and we use their ideas to then the posterior probability of classifying him/her as the disease status D* c is
* * * *
* *
* * * *
measured surrogates. We can estimate the right hand side of (5.2) by
and maximum estimated posterior probability is reached, i.e.,
6 Example
In this section, we consider the Breast cancer data (continuous) and Schizophrenia syndrome scale data (categorical) examples, and use standard k-means and alternate k-means clustering method to estimate the parameters in original LCA model and our model.
Furthermore, we use the proposed classification rule (Huang, &Wang, &Hsu) for prediction.
Here, we introduced a useful tool for clustering. Heatmap has the notion of rearranging the columns and rows to show structure in the data. A heatmap is a two-dimensional, rectangular, colored grid, and shows data that themselves come in the form of a rectangular matrix. The color of each rectangle is determined by the value of the corresponding entry in the data matrix. The rows and columns of the matrix can be rearranged independently. Usually they are using clustering methods for reorder such that similar rows are placed next to each other, and the same for columns. Among the orderings that are widely used are those derived from a hierarchical clustering, but many other orderings are possible. If hierarchical clustering is used, then it is customary that the dendrograms are provided as well. Here, we use non-hierarchical clustering methods (i.e., k-means and alternate k-means clustering methods) to find some subgroup for individuals and plot the heatmap by these groups. On the other hand, we use agglomerative hierarchical clustering methods to grouping the surrogates with distance measurement using one minus correlation. We will use the heatmap figures to show our result.
6.1 Breast cancer data
The data come from a study of using gene expression profiling to predict breast cancer outcome (Veer et al., 2002). The 78 sporadic lymph-node-negative patients under 55 years of age were selected specifically to search for a prognostic signature in their gene expression profiles. Forty-four patients remained free of disease after their initial diagnosis for an interval
of at least 5 years (good prognosis group, mean follow-up of 8.7 years), and 34 patients had developed distant metastases within 5 years (poor prognosis group, mean time to metastases 2.5 years). From each patient, total RNA was isolated from tumor material and used to drive cRNA. A reference cRNA pool was made by pooling equal amounts of cRNA from each of the sporadic carcinomas. Fluorescence intensities were quantified, normalized and corrected to yield the transcript abundance of a gene as an intensity ratio with respect to that of the signal of the reference pool (Hughes et al., 2001).
Here, we aim to predict good and poor prognostic patients through gene expression profiling. We use a two-step selection process was performed to retain genes in the analysis.
Firstly, 4741 genes selected from 24481 genes with the intensity ratio > 2 or < 0.5 (i.e., more than two-fold difference) and the significance of regulation p-value < 0.01 in more than 3 patients. This was used in the original paper and focused the attention to the most informative genes. In the second step, we applied a selection of genes based on the ratio of their between-group to within-group sums of squares, as suggested by (Dudoit, Fridlyand, & Speed, 2002). For a gene m, that ratio is 200 genes with the largest BW ratios for finite mixture analysis.
Using 200 selected expression ratios as observed surrogates, a finite mixture model (3.1), (3.2), (3.4) was fitted. In the fitted model, age at diagnosis (year) was chosen to be associated with conditional probabilities, and latent prevalence was also modeled as depending on age at
diagnosis. We used the standard k-means clustering approach to group patients and resulted in 154-gene (selected) expression profile are displayed in Figures 4 and 5.
An additional independent set of primary tumors from 19 young, lymph-node-negative breast cancer patients was used to validate the above 154-gene prognosis classifier. This group consisted of 7 patients who remained free of disease for at least five years, and 12 patients who developed distant metastases within five years. Table 1 and 2 shows the result of prediction from the standard k-means and alternate k-means. Consequently, the standard k-means approaches had 4 out of 19 incorrect classifications, but the alternate k-means approaches had 3 out of 19 incorrect classifications.
6.2 Schizophrenia syndrome scale data
The data were collected from a series of projects, aiming at investigating the clinical manifestations of schizophrenia and searching for neuropsychological, environmental and genetic factors underlying schizophrenia. Details of study design and eligibility criteria were described previously (Liu, Hwu, & Chen, 1997; Chen et al., 1998; Chang et al., 2002). The analyzed data include 164 acute-state patients of schizophrenia who were recruited within one week of index admission and 155 subsided stage patients who were living with community and under family care.
In this study, schizophrenia symptoms were assessed by the Positive and Negative Syndrome Scale (PANSS) (Cheng, Ho, Chang, Lane, & Hwu, 1996). The PANSS has 30
items and consists of three subscales: positive (seven symptoms: P1-P7), negative (seven symptoms: N1-N7) and general psychopathology (sixteen symptoms: G1-G16). Each item was originally rated on a 7-point scale (1=absent, 7=extreme), but we reduced the 7-point scale by merging the points that had the response percentages less than 10%. This study considered external covariates including demographic variables and environmental / neuropsychological factors. Demographic variables included gender, age at recruitment, years of education, and occupation (having versus no occupation). The category of no occupation included housewives, students, unemployed and retired people. The environmental factors were related to obstetric complications, prenatal growth retardation, special personal behavior and psychological adjustment problems. And the neuropsychological batteries assessed reaction time, attention, speed of information processing, and active problem solving.
Specifically, the test batteries included several standard neuropsychological instruments with demonstrated reliability and validity, and we concentrated on the Continuous Performance Test (CPT), which had been widely used to measure sustained attention deficits in psychotic disorders (Chen et al., 1998).
The analysis aims to explore the subtype (groups) of schizophrenia patients based on PANSS measurement. In our application, the latent class model of (3.1), (3.2), and (3.3) was applied to 30 PANSS items. We let the covariates associated with conditional probabilities include variables of sex, age (year), years of education (year), and occupation (with versus without occupation), and the covariates associated with latent prevalence include variables of age of onset (year), envir11, envir21, envir22, envir31, envir32, and dprime. We used the standard k-means clustering approach to group patients and resulted in 4 classes of size 231, 31, 52, and 5. We choose the tuning parameter 30 (the initial LoI 2
surrogates. This approach grouped patients in 4 classes of size 221, 41, 47, and 10. The heatmap for the 30-item (original) and 19item (selected) are showed in Figures 6 and 7.
In general, class 1 appeared to represent a group who had severe/extreme positive symptoms and moderate negative symptoms; class 2 was a group who had moderate positive symptoms but mild negative symptoms; class 3 represented a group who had widespread whole syndrome of severe positive and negative symptoms; and class 4 was a remitted group who rarely had any symptom.
Then, we are interested in using the PANSS ratings to predict patients’ phases of chronicity of disease (acute versus subsided). There has 10 patients in the prediction group which is consisted of 5 acute patients and 5 subsided patients. Table 3 and 4 shows the result of predicts. Consequently, the standard k-means approaches had 3 out of 10 incorrect classifications, and the alternate k-means approaches had just 1 out of 10 incorrect classifications.
7 Discussion
We have proposed to use the alternate k-means clustering methods to search for the optimal class allocation that can make clustering surrogates as independent as possible for objects belonging the same class, select the surrogates for estimating parameters in the model and create classification rule. By treating the identified class allocation as a known predictor, the parameters underlying a finite mixture model can then be estimated. We further use a classification rule, based on the finite mixture model. From the real data analysis, we demonstrate the ability in surrogate selection and handling the high-dimensional data and the accuracy of the classification rule in predicting new observations' unknown disease statuses.
Here, we can see that the alternate k-means clustering method can reduce the size of surrogates and predict new observations' unknown disease statuses more accurate than original K-means clustering method.
Reference
Bandeen-Roche, K., Miglioretti, D. L., Zeger, S. L., & Rathouz, P. J. (1997). Latent variable regression for multiple outcomes. Journal of the American Statistical Association, 92 , 1375-1386.
Brusco, M. J., & Cradit, J. D. (2001). A variable selection heuristic for k-means clustering.
Psychometrika, 66 , 249-270.
Chang, C. J., Chen, W. J., Liu, S. K., Cheng, J. J., Ou Yang, W. C., Chang, H. J., et al. (2002).
Morbidity risk of psychiatric disorders among the first degree relatives of schizophrenia patients in taiwan. Schizophrenia Bulletin, 28 , 379-392.
Chen, W. J., Liu, S. K., Chang, C. J., Lien, Y. J., Chang, Y. H., & Hwu, H. G. (1998).
Sustained attention deficit and schizotypal personality features in nonpsychotic relatives of schizophrenic patients. American Journal of Psychiatry, 155 , 1214-1220.
Cheng, J. J., Ho, H., Chang, C. J., Lane, S. Y., & Hwu, H. G. (1996). Positive and negative syndrome scale (panss): Establishment and reliability study of a mandarin chinese language version. Taiwanese Journal Psychiatry, 10 , 251-258.
Dayton, C. M., & Macready, G. B. (1998). Concomitant-variable latent-class models.Journal of the American Statistical Association, 83 , 173-178.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B, 39 , 1-38.
Dudoit, S., Fridlyand, J., & Speed, T. P. (2002). Latent variable regression for multiple outcomes. Comparison of discrimination methods for the classi¯cation of tumors using gene expression data, 97 , 77-87.
Friedman, J. H., & Meulman, J. J. (2004). Clustering objects on subsets of attributes. Journal of the Royal Statistical Society. Series B, 66 , 815-849.
Frank, I. E. & Friendman, J. H. (1993). A statistical view of some chemometrics regression
tools. Technometrics, 35, 109-148.
Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61 , 215-231.
Huang, G. H. (2005). Selecting the number of classes under latent class regression: a factor analytic analogue. Psychometrika, 70 , 325-345.
Huang, G. H., & Bandeen-Roche, K. (2004). Building an identifiable latent class model with covariate effects on underlying and measured variables. Psychometrika, 69 , 5-32.
Huang, G. H., Wang, S.M., & Hsu, C.C. Prediction of Underlying Latent Classes via K-means and Hierarchical Clustering Algorithms. Manuscript.
Hughes, T. R., Mao, M., Jones, A. R., Burchard, J., Marton, M. J., Shannon, K. W., et al.
(2001). Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nature Biotechnology, 19 , 342-347.
Landwehr, J. M., Pregibon, D., & Shoemaker, C. (1984). Graphical methods for assessing logistic regression models. Journal of the American Statistical Association, 79 , 61-71.
Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. New York:
Houghton-Mifflin.
Ledoit, O., & Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88 , 365-411.
Liu, S. K., Hwu, H. G., & Chen, W. J. (1997). Clinical symptom dimensions and deficits on the continuous performance test in schizophrenia. Schizophrenia Research, 25 , 211-219.
McCullagh, P., & Nelder, J. A. (1989). Generalized linear models, second edition. London:
Chapman and Hall.
Melton, B., Liang, K. Y., & Pulver, A. E. (1994). Extended latent class approach to the study of familial/sporadic forms of a disease: its application to the study of the heterogeneity of schizophrenia. Genetic Epidemiology, 11 , 311-327.
Moustaki, I. (1996). A latent trait and a latent class model for mixed observed variables.
British Journal of Mathematical and Statistical Psychology, 49 , 313-334.
Muthen, L. K., & Muthhen, B. O. (2007). Mplus user's guide. fifth edition. Los Angeles:
Muthen, L. K., & Muthhen, B. O. (2007). Mplus user's guide. fifth edition. Los Angeles: