Example

Schizophrenia Data

The present study is composed of three projects, the Multidimensional Psychopathology Group Research Projects (MPGRP), the Multidimensional Psychopathological Study on Schizophrenia (MPSS) and the Study on Etiological Factors of Schizophrenia (SEFOS). The initial project MPGRP investigated the clinical manifestations of schizophrenia in a cohort of schizophrenia patients. The subsequent project MPSS focused on the follow-up neuropsychological evaluation of the MPGRP patients. The project SEFOS aimed to search for neurobiological, environmental and genetic factors underlying schizophrenia. The analyzed data include 169 acute-state patients who had completed the PANSS within one week of index admission and 161 subsided-state patients who were living with community and under family care.

The major instrument applied in this study is the PANSS, were used to collect patients’ symptom measurements, an assessment of the clinical psychopathological symptoms of schizophrenia. It has 30 items rated on a 7-point scale (1=absent, 7=extreme). The PANSS consists of three subscales: positive (seven symptoms:

P1-P7), negative (seven symptoms: N1-N7), and general psychopathology (sixteen symptoms: G1-G16). Because the original 7-point scale is too complex and has too many parameters to analyze, we reduced the 7-point scale on PANSS by merging the scales which have the percentages less than 5% on each item.

Demographic variables included gender, age, onset-age of psychotic symptoms, years of education, and occupation (having versus no occupation). The category of no occupation included housewives, students, unemployed and retired people.

The environmental factors were related to obstetric complications, prenatal

growth retardation, special personal behavior and psychological adjustment problems.

There were three environmental questions including: (1) the patient had brain injury in the developmental process, such as premature birth, brain damage and retarded intelligence; (2) the patient had unstable mood or abnormal behavioral traits to interfere with daily life, including angry, timid, depressed and inactive; and (3) the patient had psychological adjustment problems to interfere with daily life, including bad relation between parents, getting along badly with sibling, getting physical disease and unforeseen happenings of family. All three environmental factors were rated by a 3-point scale with 0 as no event, 1 as slight and no obvious effect on emotional and behavioral reacting, and 2 as obvious effect on emotional and behavioral reacting.

The neuropsychological batteries assessed reaction time, attention, speed of information processing, and active problem solving. Specifically, the test batteries included several standard neuropsychological instruments with demonstrated reliability and validity, including the Continuous Performance Test (CPT), Wisconsin Card Sorting Test (WCST), Wechsler Adult Intelligence Scale-Revised (WAIS-R), Wechsler Memory Scale-Revised (WMS-R), and Trail Making Tests A and B (TMT-A and -B). Here we concentrated on CPT.

We fit RLCA model with 30 7-level measured indicators, the covariates associated with conditional probabilities include variables of sex, age (year), years of education (year), and occupation (with versus without occupation), and the covariates associated with latent prevalences include variables of age of onset (year), envir11, envir21, envir22, envir31, envir32, and dprime.

We group objects by k-means and divisive hierarchical approaches, and the analysis reported here aims to describe the associations between risk factors and underlying latent class, and examine the composition of patient subtypes across

different disease states.

Here, we introduced a useful tool for clustering. Heatmap has the notion of rearranging the columns and rows to show structure in the data. A heatmap is a two-dimensional, rectangular, colored grid. It displays data that themselves come in the form of a rectangular matrix. The color of each rectangle is determined by the value of the corresponding entry in the matrix. The rows and columns of the matrix can be rearranged independently. Usually they are reordered so that similar rows are placed next to each other, and the same for columns. Among the orderings that are widely used are those derived from a hierarchical clustering, but many other orderings are possible. If hierarchical clustering is used, then it is customary that the dendrograms are provided as well. In many cases the resulting image has rectangular regions that are relatively homogeneous and hence the graphic can aid in determining which rows (generally the genes) have similar expression values within which subgroups of samples (generally the columns).

Results for patients at the acute state by divisive hierarchical clustering method

Heatmap for patients at the acute state was shown in Figure 3. The column dendrogram is agglomerative hierarchical clustering method with distance measurement using one minus correlation and the row dendrogram is our divisive hierarchical clustering with distance measurement using one minus loss of independence. The color of each cell represented the extent of induction or repression of a given gene.

Although the heatmap did not display the class structure clearly, we can use the dendrogram of divisive hierarchical method at the left to group objects into four classes.

Table (37) contains the scores (mean ± standard error) of 30 items (or 5 factors) in each class, we can characterize four classes as follows. Class 1 has lower scores

(mean) on the factor 2, factor 3, and factor 4. Class 3 has higher scores (mean) on the factor 2, factor 3, and factor 5. The scores of four classes on the factor 1 are similar.

Table (38) includes odds ratios for the relationship between classes of schizophrenia at the acute state and demographic/environmental/neuropsychological variables. Notice that odds ratios for each class were compared with the class 4 (the reference class).

By comparing with the patients of the class 4, patients of the class 1 were more likely to be female. Patients of the class 2 tended to have unstable mood or abnormal behavior to interfere (obviously) with their life.

Results for patients at the subside state by divisive hierarchical clustering method

Heatmap for patients at the subsided state was shown in Figure 4. We clustered the objects into three classes.

Table (39) includes the scores (mean ± standard error) of 30 items (or 4 factors) in each class, we can characterize three classes as follows. Class 1 has higher scores (mean) on each factor. The scores in class 2 and class 3 are not different on each factor.

Table (40) contains the odds ratios for the relationship between class of the subsided schizophrenia and demographic/environmental/neuropsychological variables.

Odds ratios for each subtype are compared with the class 3 (the reference subtype).

By comparing with the patients of the class 3, patients of the class 1 tended to be younger age of onset and have unstable mood or abnormal behavior to interfere (obvious) with their life. Patients of the class 2 were more likely to have psychological adjustment problems to slightly interfere with their life.

Results for patients at the acute state by k-means clustering method

Table (41) includes the scores (mean ± standard error) of 30 items (or 5 factors) in each class. Class 3 has higher scores on each factor and class 2 has lower scores on

factor 2, factor 3, factor 4, and factor 5. However, class 1 in Table (37) may correspond to class 2 in Table (41). These results clustered by k-means method are closed to the results clustered by divisive hierarchical method.

Table (42) contains the odds ratios for the relationship between class of the acute schizophrenia and demographic/environmental/neuropsychological variables. Odds ratios for each subtype are compared with the class 4 (the reference subtype). By comparing with the patients of the class 4, patients of the class 1 and class 2 were more likely to be female. The results of k-means clustering method are closed to the results of divisive hierarchical method.

Results for patients at the subside state by k-means clustering method

Table (43) includes the scores (mean ± standard error) of 30 items (or 4 factors) in each class. Class 1 has higher scores (mean) and the scores in class 3 are lower on each factor.

Table (44) contains the odds ratios for the relationship between class of the subsided schizophrenia and demographic/environmental/neuropsychological variables.

Odds ratios for each subtype are compared with the class 3 (the reference subtype).

By comparing with the patients of the class 3, patients of the class 1 and class tended to be younger age of onset.

Breast Cancer Data

Here we used DNA microarray analysis on primary breast tumours of 117 young patients. The 78 sporadic lymph-node-negative patients under 55 years of age were selected specifically to search for a prognostic signature in their gene expression profiles. Forty-four patients remained free of disease after their initial diagnosis for an interval of at least 5 years (good prognosis group, mean follow-up of 8.7 years), and 34 patients had developed distant metastases within 5 years (poor prognosis group, mean time to metastases 2.5 years). This dataset record the mean ratio of the intensities of the red and green channels, this reflects the extent of induction or repression of a given gene, and p-values, means that a gene’s mean ratio is significantly different from 1, or no change. Besides, the covariates, age (year) and metastasis of year (1, if metastases > 5 years; 0, otherwise), are contained.

This gene expression microarray experiments can generate data sets with multiple missing expression values. Many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and k-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied.

Troyanskaya et al. (2001) suggests that k-nearest neighbors (KNN) approach provides accurate and robust estimates of missing values. The KNN-based method selects genes with expression profiles similar to the gene of interest to impute missing values.

For instance, if we consider gene A that has one mission value in experiment 1, this method would find K other genes, which have a value present in experiment 1 , with expression most similar to A in experiments 2-N (where N is the total number of experiments). A weighted average of values in experiment 1 from the K closest genes

is then used as an estimate for the missing value in gene A. In the weighted average, the contribution of each gene is weighted by similarity of its expression to that of gene A.

In brief, approximately 5,000 genes (with at least a twofold difference and a p-value of less than 0.01 in more than five tumours) were selected from the 25,000 genes.

Standardizing the data in this fashion achieves a location and scale normalization of the different arrays. In a study of normalization methods, we have found scale adjustment to be desirable in some cases to prevent the expression levels in one particular array (Yang et al. 2001). These 5,000 genes were standardized so that the observations (arrays) have mean 0 and variance 1 across variables (genes).

Many genes exhibit near-constant expression levels across tumor samples. We thus applied a preliminary selection of genes based on the ratio of their between-group to within-group sums of squares. For a gene j, this ratio is

( ) ( ) ( )

where x_._jand x denote the average expression level of gene j across all tumor _kj samples and across samples belonging to class k only.

We use (6.1) to compute BW ratio for each gene and select 70 genes with larger BW ratios from 5,000 genes for our study.

For the continuous data, RLCA model, with 70 measured indicators, the covariates associated with conditional probabilities include variables of age (year), and the covariates associated with latent prevalences include variables of metastasis of year, can be applied when rewriting (3.2) as the following

(

i i i

)

⁼

_∑

^J_j₌

{ (

Si ⁼ j i

) _∏

^M_m₌ f

(

im Si ⁼ j im

) }

where f⁽yim^|Si = j^,zim⁾^~ N

(

μimj^,σm²

)

. The parametersμ_imjandσ_m²can be replaced by the estimations as the following

(

_im _i _im

)

_m _m _i _m_J _i _J _m _im _Lm _imL

It can also predict the class membership of the additional data using the posterior probability of class membership

Results for breast cancer data with divisive hierarchical clustering method

Heatmap was applied to microarray data by Eisen et al. (1998) and have become a standard visualization method for this type of data.

The heatmap for 70-gene profile is displayed in Figure 5. The column dendrogram is agglomerative hierarchical clustering method with distance measurement using one minus correlation and the row dendrogram is our divisive hierarchical clustering with distance measurement using one minus loss of independence. The color of each cell represented the extent of induction or repression of a given gene. We can easily group objects into two classes and include 39 objects in each class. Notably, in the upper group only 30% of the patients were from the group who developed distant metastases greater than 5 years, whereas in the lower group 82% of the patients had good prognosis disease. Thus we can distinguish between “good prognosis” and “poor prognosis” patients.

We fit a two-class RLCA model with the covariates associated with conditional

probabilities include variable of age (year) and the covariates associated with latent prevalences include variable of metastasis more than 5 years (1, if more than 5 years;

0, otherwise).

Table (45) includes odds ratios for the relationship between classes of tumours and age and variable of metastasis more than 5 years. Notice that odds ratios for each class were compared with the class 2. By comparing with the patients of the class 2, patients of the class 1 were more likely to be metastasis more than 5 years.

To validate our method, an additional independent set of primary tumours from 19 young, lymph-node-negative breast cancer patients was selected. This group consisted of 7 patients who remained metastasis free for at least five years, and 12 patients who developed distant metastases within five years. The disease outcome was predicted by the posterior probability of class membership in Table (46) and resulted in 3 out of 19 incorrect classifications.

Results for breast cancer data with k-means clustering method

Table (47) includes odds ratios for the relationship between classes of tumours and age and variable of metastasis more than 5 years, and Table (48) displayed that the performance of predictions of class membership. These results consist with the results of divisive hierarchical method, because the k-means method is a case of divisive hierarchical clustering method for grouping 2-class.

在文檔中藉由K均值分群與分裂式分群程序預測潛在群體 (頁 40-49)

( ) ( ) ( )

(

)

∑

{ (

) ∏

(

) }

(

)

(

)

_∑

) _∏