In this section, we review some relevant research works related to gene expression data analysis, including clustering, classification, and gene selection.
Cluster analysis is a way to group a collection of objects into subsets or “clusters” such that the objects within the same cluster are similar to each other and the objects in different clusters are quite distinct. It is by virtue of this feature that clustering can be used in displaying the patterns of gene expression data. Moreover, we can gain different information according to the items (subjects or genes) we cluster. To cluster subjects according to their gene expression levels is an unsupervised classification method, which is also helpful to class discovery (Golub et al., 1999). On the other hand, gene-clustering reveals the patterns of the gene expression levels. Alon et al. (1999) used a two-way clustering method in analyzing colon cancer data. As a result, the clustering algorithm reveals broad patterns coherent of genes whose expression levels are correlated, suggesting a high degree of organization underlying gene expression in these tissues. There are many researchers who devote themselves to finding a better clustering algorithm, e.g., Tseng and Wong (2003) proposed a tight clustering algorithm that is a resampling-based approach to identify stable and tight patterns in data by using K-means clustering as an intermediate clustering engine.
Unlike clustering subjects, many researchers utilized supervised learning methods to deal with classification problem. The basic concept is using a training data set to build a decision function, New observation (test data) then can be classified according to the sign of i.e.,
which is a weighted sum of the gene expression levels plus bias. A data set is said to be
“linearly separable” if the subjects can be separated into two classes by a linear decision function.
Golub et al. (1999) created a class predictor based on weighted votes of a set of informative genes for the famous leukemia data. Those informative genes are selected by the following ranking criterion: strong correlation with class (
wi
+) whereas large negative values indicate strong correlation with class Originally, Golub et al. selected an equal number of genes from positive and negative values of This gene selection method is a filter method. Other ranking criteria have also been used. Furey et al. (2000) used the absolute value of (2).
Pavlidis et al. (2000) used
wi as the ranking criterion, which is similar to Fisher’s discriminant criterion. Dudoit et al. (2002) performed a preliminary selection of genes based on the ratio of their between-group to within-group sums of squares to compare several different discrimination methods, including Fisher linear discriminant analysis, maximum likelihood discriminant rules, nearsest-neighbor classifiers, classification trees, and aggregating classifiers: bagging and boosting. For each gene , this ratio is i expression levels of gene across all subjects and across the subjects belonging to class i k
only.
Guyon et al. (2002) proposed a gene selection scheme called Recursive Feature Elimination (RFE), which is a typical wrapper method. They utilized Support Vector Machines (SVMs) as the classifier and took the squared weights of genes in the decision function constructed by the classifier as the ranking criterion in linear case. The intuition behind this ranking criterion is that features with larger weights in the decision function may be more informative. The procedure eliminates genes one by one with the following steps in each iteration:
1. Train the classifier.
2. Compute the ranking criterion for all features.
3. Remove the feature with the smallest ranking criterion.
Leukemia data and colon cancer data were used in Guyon et al. (2002) to demonstrate that genes selected by RFE yield better classification performance and are biologically relevant to cancer.
Filter methods select informative genes by evaluating individual discriminability, which may result in picking up a set of highly correlated genes. This can be understood intuitively that ranking criterion would give close values to highly correlated genes. In view of this, Jeager et al. (2003) utilized the fuzzy K-means clustering algorithm to cluster similar genes to avoid redundancy and selected discriminative genes from each cluster depending on five different statistics. The main idea is that a cluster might represent a pathway. They used a fuzzy clustering algorithm because it assigns for each gene a membership probability to each of the clusters and may therefore capture the fact that some genes are involved in several pathways. The size and quality of a cluster play a part in deciding how many genes are selected. If a cluster is very tight and dense it means that those members are very similar. On the other hand, if a cluster has wide dispersion, the members of the cluster are more heterogeneous. To capture the biggest possible variety of genes, it would therefore be
favorable to take more genes from a cluster of bad quality than from a cluster with good quality. To determine the cluster quality for the fuzzy clustering algorithm, they used the membership probabilities of a gene. A gene belongs to the cluster to which it has the highest membership probability. The cluster quality is then assessed by looking at the average membership probability of its elements. A high cluster quality means low dispersion, and the closer the quality is to zero the more scattered the cluster becomes. To counteract the problem that a cluster is totally unrelated to the discrimination, they also implemented “masked out clustering” to mask out and exclude clusters that have an average bad test statistic p-value.
They varied the number of clusters between 1 and 30 and the number of selected features between 2 and 100. Finally, a ROC (receiver operator curves) scores (i.e., the area under the ROC graph) is used to assess the performances.
Also, Ding and Peng (2003) proposed a minimum redundancy – maximum relevance (MRMR) method to select a feature set by minimizing redundancy in the set and maximizing relevance to the target classification problem. They used two criteria to represent the redundancy and relevance in a feature set, respectively. MRMR criterion function is the combination of the two criteria. For example, in the two-class classification problem, Pearson correlation coefficient and t-statistic can be chosen as the score of minimum redundancy and maximum relevance, respectively, for continuous variables. Hence, for the feature set the minimum redundancy condition can be written as:
,
where is the Pearson correlation coefficient of feature and feature . And the maximum relevance condition can be written as:
( , )
( )
max t( ) c( )
S V S −W S or Euclidean distance is another score of
minimum redundancy for continuous variables besides Pearson correlation coefficient. For the multi-class classification problem, they used F-statistic as the score of maximum relevance.
They also proposed two MRMR optimization criterion functions for categorical (discrete) variables in a similar way.
(
max t( ) / c( ) .
S V S W S
)
We follow the idea of Jeager et al. (2003) in this study, but filter out genes with little or no effects in classification before clustering to avoid selecting irrelevant genes. After selecting a list of candidate genes from each cluster, RFE is used to decide final gene set of an expected size.