Example - 藉由交替K均值分群程序對潛在群體做預測

In this section, we consider the Breast cancer data (continuous) and Schizophrenia syndrome scale data (categorical) examples, and use standard k-means and alternate k-means clustering method to estimate the parameters in original LCA model and our model.

Furthermore, we use the proposed classification rule (Huang, &Wang, &Hsu) for prediction.

Here, we introduced a useful tool for clustering. Heatmap has the notion of rearranging the columns and rows to show structure in the data. A heatmap is a two-dimensional, rectangular, colored grid, and shows data that themselves come in the form of a rectangular matrix. The color of each rectangle is determined by the value of the corresponding entry in the data matrix. The rows and columns of the matrix can be rearranged independently. Usually they are using clustering methods for reorder such that similar rows are placed next to each other, and the same for columns. Among the orderings that are widely used are those derived from a hierarchical clustering, but many other orderings are possible. If hierarchical clustering is used, then it is customary that the dendrograms are provided as well. Here, we use non-hierarchical clustering methods (i.e., k-means and alternate k-means clustering methods) to find some subgroup for individuals and plot the heatmap by these groups. On the other hand, we use agglomerative hierarchical clustering methods to grouping the surrogates with distance measurement using one minus correlation. We will use the heatmap figures to show our result.

6.1 Breast cancer data

The data come from a study of using gene expression profiling to predict breast cancer outcome (Veer et al., 2002). The 78 sporadic lymph-node-negative patients under 55 years of age were selected specifically to search for a prognostic signature in their gene expression profiles. Forty-four patients remained free of disease after their initial diagnosis for an interval

of at least 5 years (good prognosis group, mean follow-up of 8.7 years), and 34 patients had developed distant metastases within 5 years (poor prognosis group, mean time to metastases 2.5 years). From each patient, total RNA was isolated from tumor material and used to drive cRNA. A reference cRNA pool was made by pooling equal amounts of cRNA from each of the sporadic carcinomas. Fluorescence intensities were quantified, normalized and corrected to yield the transcript abundance of a gene as an intensity ratio with respect to that of the signal of the reference pool (Hughes et al., 2001).

Here, we aim to predict good and poor prognostic patients through gene expression profiling. We use a two-step selection process was performed to retain genes in the analysis.

Firstly, 4741 genes selected from 24481 genes with the intensity ratio > 2 or < 0.5 (i.e., more than two-fold difference) and the significance of regulation p-value < 0.01 in more than 3 patients. This was used in the original paper and focused the attention to the most informative genes. In the second step, we applied a selection of genes based on the ratio of their between-group to within-group sums of squares, as suggested by (Dudoit, Fridlyand, & Speed, 2002). For a gene m, that ratio is 200 genes with the largest BW ratios for finite mixture analysis.

Using 200 selected expression ratios as observed surrogates, a finite mixture model (3.1), (3.2), (3.4) was fitted. In the fitted model, age at diagnosis (year) was chosen to be associated with conditional probabilities, and latent prevalence was also modeled as depending on age at

diagnosis. We used the standard k-means clustering approach to group patients and resulted in 154-gene (selected) expression profile are displayed in Figures 4 and 5.

An additional independent set of primary tumors from 19 young, lymph-node-negative breast cancer patients was used to validate the above 154-gene prognosis classifier. This group consisted of 7 patients who remained free of disease for at least five years, and 12 patients who developed distant metastases within five years. Table 1 and 2 shows the result of prediction from the standard k-means and alternate k-means. Consequently, the standard k-means approaches had 4 out of 19 incorrect classifications, but the alternate k-means approaches had 3 out of 19 incorrect classifications.

6.2 Schizophrenia syndrome scale data

The data were collected from a series of projects, aiming at investigating the clinical manifestations of schizophrenia and searching for neuropsychological, environmental and genetic factors underlying schizophrenia. Details of study design and eligibility criteria were described previously (Liu, Hwu, & Chen, 1997; Chen et al., 1998; Chang et al., 2002). The analyzed data include 164 acute-state patients of schizophrenia who were recruited within one week of index admission and 155 subsided stage patients who were living with community and under family care.

In this study, schizophrenia symptoms were assessed by the Positive and Negative Syndrome Scale (PANSS) (Cheng, Ho, Chang, Lane, & Hwu, 1996). The PANSS has 30

items and consists of three subscales: positive (seven symptoms: P1-P7), negative (seven symptoms: N1-N7) and general psychopathology (sixteen symptoms: G1-G16). Each item was originally rated on a 7-point scale (1=absent, 7=extreme), but we reduced the 7-point scale by merging the points that had the response percentages less than 10%. This study considered external covariates including demographic variables and environmental / neuropsychological factors. Demographic variables included gender, age at recruitment, years of education, and occupation (having versus no occupation). The category of no occupation included housewives, students, unemployed and retired people. The environmental factors were related to obstetric complications, prenatal growth retardation, special personal behavior and psychological adjustment problems. And the neuropsychological batteries assessed reaction time, attention, speed of information processing, and active problem solving.

Specifically, the test batteries included several standard neuropsychological instruments with demonstrated reliability and validity, and we concentrated on the Continuous Performance Test (CPT), which had been widely used to measure sustained attention deficits in psychotic disorders (Chen et al., 1998).

The analysis aims to explore the subtype (groups) of schizophrenia patients based on PANSS measurement. In our application, the latent class model of (3.1), (3.2), and (3.3) was applied to 30 PANSS items. We let the covariates associated with conditional probabilities include variables of sex, age (year), years of education (year), and occupation (with versus without occupation), and the covariates associated with latent prevalence include variables of age of onset (year), envir11, envir21, envir22, envir31, envir32, and dprime. We used the standard k-means clustering approach to group patients and resulted in 4 classes of size 231, 31, 52, and 5. We choose the tuning parameter 30 (the initial LoI _{ }2

surrogates. This approach grouped patients in 4 classes of size 221, 41, 47, and 10. The heatmap for the 30-item (original) and 19item (selected) are showed in Figures 6 and 7.

In general, class 1 appeared to represent a group who had severe/extreme positive symptoms and moderate negative symptoms; class 2 was a group who had moderate positive symptoms but mild negative symptoms; class 3 represented a group who had widespread whole syndrome of severe positive and negative symptoms; and class 4 was a remitted group who rarely had any symptom.

Then, we are interested in using the PANSS ratings to predict patients’ phases of chronicity of disease (acute versus subsided). There has 10 patients in the prediction group which is consisted of 5 acute patients and 5 subsided patients. Table 3 and 4 shows the result of predicts. Consequently, the standard k-means approaches had 3 out of 10 incorrect classifications, and the alternate k-means approaches had just 1 out of 10 incorrect classifications.

在文檔中藉由交替K均值分群程序對潛在群體做預測 (頁 36-41)