• 沒有找到結果。

1.2.1 Gene Expression Classification Problems

Given a large number of profiles contained thousands of genes in each experiment, we want to understand a global overview among lots of genes involved in the microarray experiments [6]. In such a case, gene expression classification was used to determine function for unknown genes [7], to look at expression programs for different systems in the cell [8] and for identifying sets of genes that are specifically involved in a certain type of cancer or other diseases [9]. Another major purpose in gene expression classification is effective data organization and visualization. It is thus not surprising that early work on gene expression analysis has focused on this level, and several classification algorithms have been suggested for gene expression data [10, 11].

The practical applications of microarray gene expression profiles include management of cancer and infectious diseases. There are many machine learning techniques, such as support vector machine (SVM), neural networks (NN), k-nearest neighbor rule (k-NN), and logistic regression have been used in gene expression data classification [12, 13]. However, due to the following three features about microarray data analysis, gene expression classification still remains difficult:

1) high dimensionality: there are thousands of genes (or features) in the microarray experiment;

2) few samples: compared with the number of genes, the number of samples was relatively few, usually fewer than one hundred;

3) given thousands of genes, only a small number of them show strong correlation with a certain phenotype [14].

Statnikov et al. investigated various classifiers which can handle data sets having multiple classes [12]. The results indicate that the multicategory SVM is the most effective classifier for tumor classification in terms of classification accuracy using large numbers of genes. However, given thousands of genes, only a small number of them show strong

correlation with a certain phenotype [14]. Unfortunately, it is intractable to identify the optimal subset from thousands of genes, while taking classification accuracy and linguistic interpretability into account.

Liu et al. proposed a feature selection method which combines top-ranked, test-statistic, and principle component analysis in conjunction with ensemble NN to design classifiers [15]. Zhou and Mao suggested a filter-like evaluation criterion, called LS Bound measure, derived from leave-one-out procedure of least squares support vector machines (LS-SVMs), which provides gene subsets leading to more accurate classification [16]. Liu et al. combined the entropy-based feature (gene) selection method using simulated an-nealing and k-NN classifier for cancer classification [17].

To advance the classification performance using a small number of genes, it is better to take both gene selection and classifier design into account simultaneously. Li et al.

proposed a hybrid method of the genetic algorithm (GA)-based gene selection and k-NN classifier to assess the importance of genes for classification [18]. Ooi and Tan proposed a GA/MLH (maximal likelihood)-based method for the multicategory prediction of gene expression data [19].

An accurate classifier with linguistic interpretability is beneficial to microarray data analysis. However, the learning results of the above-mentioned classifiers cannot be sum-marized into human-interpretable forms for biologists and biomedical scientists [13]. Li et al. used a tree structure to classify the microarray samples [20]. Hvidsten et al. proposed learning rule-based models of biological process from gene expression time profiles using gene ontology [21]. Vinterbo et al. presented a rule-induction and filtering strategy to ob-tain an accurate, small, and interpretable fuzzy classifier using a grid partition of feature space, compared with the classifier of logistic regression [13]. However, the grid partition method often results in too many fuzzy rules for human to handle. And the adopted rule filtering strategies often cause the loss of accuracy.

1.2.2 Genetic Network Inference Problems

The goal of constructing genetic network models is to reveal the regulation rules behind the gene expression data. The genetic network may be used as instructions for further

biological experiments to discover more delicate and substantial functions in molecular biology, biochemistry, bioengineering, and pharmaceutics. The traditional biological ex-periments mainly concentrate on small-scale or local reaction among parts of complex biological system behavior. When faced with large-scale genetic networks, the efficient method with increased computational efficiency is desirable.

Most of the mathematical algorithms and models proposed to describe biochemical networks include [22]: Boolean network model [23], Bayesian network [24, 25], and differ-ential model or S-system model [26]. In Boolean network models, gene expression levels can be referred to two situations, true or false. These models have the advantage that they can be solved with less computing effort. But the drawback is that they can’t quantify in-teraction intensity between genes and not adequate in analyzing cyclic network structure such as feedback regulatory loops. Bayesian network model is able to deal with linear, non-linear, and combinatorial problems also used to infer genetic networks. But similar to Boolean networks, it suffers from the same dilemma and only applicable to acyclic structures [22, 24]. To cope with the cyclic networks, some authors adopted the adapted dynamic Bayesian network [27, 28].

Another frequently used approach is to use differential equation models for analysis of gene expression. The most popular model can be referred to the S-system model which has been considered suitable to characterize biochemical network systems and capable to analyze the regulatory system dynamics [29, 30, 31, 32, 33, 34, 35, 26]. The S-system model is a set of non-linear differential equations as the following form:

dXi(t) where Xi(t) represents the expression level of gene i at time t and N is the number of genes in a genetic network. αi and βi are rate constants which indicate the direction of mass flow and must be positive. gij and hij are kinetic orders which reflect the intensity of interaction from gene j to i. For inferring an S-system model, it is necessary to estimate all the 2N (N + 1) S-system parameters (αi, βi, gij, hij) from experimental time-series data of gene expression. Essentially, this reverse engineering problem is a large-scale parameter optimization problem (LPOP) which is time-consuming and intractable.

Genetic algorithm (GA) [36] plays an important role in solving the optimization problem of dynamic modeling of genetic networks using the S-system model [29, 30, 31, 33].

Kikuchi et al. used GA with simplex crossover (SPXGA) to improve the optimization ability for dynamic modeling of genetic networks from N = 2 to 5 [29]. SPXGA suc-cessfully inferred the dynamics of a small genetic network using only time-series data of gene expression. When deal with a more complicated structure with a large number of genes (i.e., N = 10), it is hard to obtain a satisfactory solution in a limited amount of computation time. To infer large-scale genetic network models, Maki et al. proposed an efficient problem decomposition strategy to divide the inference problem into N separated small subproblems [32]. To reduce search time of the inference problem, Voit and Almeida proposed an approach to transforming the problem into several sets of decoupled algebraic equations, which can be processed efficiently in parallel or sequentially [26]. Kimura et al. used a cooperative coevolutionary algorithm with the problem decomposition strategy to efficiently infer large-scale S-system models with noisy time-series data [31]. However, the existing efficient evolutionary algorithms required parallel computing on a PC cluster for efficiently obtaining satisfactory solutions [29, 30, 31].

相關文件