Conclusions for iGEC - 運用智慧型基因演算法最佳化微陣列資料分析 - 可語意解讀基因表現量分類器之設計暨基因網路模型之重建

Microarray data analysis and gene expression classification are important research top-ics in bioinformattop-ics such that how to design an accurate, compact, and linguistically interpretable classifier is the major concern in this study. We proposed an interpretable gene expression classifier, named iGEC, for microarray data analysis. The design of iGEC includes almost all aspects related to the design of compact fuzzy rule-based classifica-tion systems: gene selecclassifica-tion, rule selecclassifica-tion, membership funcclassifica-tion tuning, consequent class determination, and certainty grade tuning. Consequently, an efficient optimization algo-rithm IGA is used to solve the resultant optimization problem with a large number of parameters.

The superiority of the proposed iGEC was evaluated by computer simulation on eight data sets of gene expression. The experimental results reveal that the proposed method can obtain interpretable classifiers with an accurate and compact fuzzy rule base, com-pared with the existing fuzzy classifier. iGEC is an efficient tool for analysis of gene expression profiles. Furthermore, the proposed iGEC can be extended to an interpretable scoring fuzzy classifier ,named iSFC, which has the ability to effectively quantify the certainty grades of samples belonging to each class.

Chapter 5 Inference of Genetic Network

In this thesis, we propose an intelligent two-stage evolutionary algorithm (iTEA) to ef-ficiently infer the S-system models of large-scale genetic networks from small-noise gene expression profiles using a single-processor PC. To cope with curse of dimensionality, the proposed algorithm consists of two stages where each uses a divide-and-conquer strategy.

The optimization problem is first decomposed into N subproblems having 2(N + 1) pa-rameters. At the first stage, each subproblem is solved using the novel intelligent genetic algorithm (IGA) which is a specific variant of the intelligent evolutionary algorithm [38].

The intelligent crossover of IGA applies orthogonal experimental design (OED) [60, 61, 62]

to speed up the search by using a systematic reasoning method instead of the conventional generate-and-go method of GA. At the second stage, the obtained N solutions to the N subproblems are combined and refined using an OED-based simulated annealing algo-rithm (OSA) [63] for handling noisy gene expression profiles. The effectiveness of iTEA is evaluated using simulated expression patterns with and without noise. It will be shown that: 1) IGA is efficient enough to solve subproblems; 2) IGA is significantly superior to the existing method SPXGA [29] in solving subproblems; and 3) iTEA performs well in inferring S-system models of genetic networks from small-noise gene expression profiles.

5.1 The Investigated Problem

5.1.1 Problem Statement

Generally, the genetic network inference problem using an S-system model is formulated as a parameter optimization problem with 2N (N + 1) S-system parameters (αi, βi, gij, hij)

and the following objective function [29, 30, 31]:

where X_exp,i,t is an experimentally observed expression level of gene i at time t, and X_cal,i,t is a numerically calculated expression level, N is the number of genes in the network, and T is the number of sampling points of observed data. When all S-system parameters are estimated, X_cal,i,t can be derived by using Eq. 2.2 and the given initial level X_exp,i,0.

Since the degree of freedom of an S-system model is high, multiple sets of time-series data are generally conducted to enhance the probability of finding correct solutions.

Because of high cost of experiments, it is not convenient to get sufficient time-series data generally. Due to the high degree of freedom, inference of the S-system model often has multiple optimal solutions to best fit the observed time-series data [26, 29, 30, 31, 32, 33, 35]. The investigated problem is difficult due to the characteristics of high degree of freedom, high dimensionality, multimodality, strong interaction among parameters of the S-system model, and measurement noise. Therefore, it is hard to obtain a correct network structure with accurate parameter values. Generally, additional data or biological knowledge is needed to improve solution quality [26].

5.1.2 Useful Techniques

Two useful techniques in optimizing the objective function 5.1 are introduced. One is the problem decomposition strategy for large-scale genetic networks [32] and the other is to incorporate a priori knowledge to reduce computation cost [30, 31, 35], described below.

Problem decomposition

The large-scale problems of S-system models are difficult to solve directly. Maki et al. [32]

proposed an efficient strategy of dividing the inference problem into N separated small subproblems. Each subproblem corresponds to one gene. The objective function of the i-th subproblem for gene i is as follows:

minimize f_i =

For noise-free or small-noise gene expression profiles, the expression level X_cal,i,t of gene i at time t can be numerically calculated by using Eq. 2.2. Otherwise, the following modified differential equations are used for large-noise gene expression profiles [30, 31]:

dXi(t) can obtain the estimated gene expression level X_cal,i,t for the i-th subproblem using Eq.

2.2 or Eq. 5.3 depending on the size of measurement noise. However, how to effec-tively obtain accurate ˆXj is essentially important. To overcome the disadvantage of the problem decomposition when dealing the given data with large measurement noise [30], Kimura et al. [31] used a cooperative coevolutionary algorithm to simultaneously solve the subproblems by deriving ˆX_j from estimating the best individuals of the subproblems, each of which is given as a solution of Eq. 5.3. It is shown empirically that the method slightly enhanced the probability of finding the correct interactions of a network using a PC cluster [31].

Adding a penalty term

In the system model, if there are no interaction between two genes i and j, the S-system parameters correspond to the interaction term, g_ij and h_ij, are zero. Because of the connectivity of the genetic network has been known to be sparse [79], the following fitness function incorporating a penalty term is conveniently added to reduce the search space and improve the accuracy of the inferred genetic network model [30, 31]:

minimize f_i =

where c is a penalty weight, I is a maximum indegree that the maximal number of genes which directly affect gene i. G_ij and H_ij are given by rearranging g_ij and h_ij in ascending order of their absolute values. The penalty term forces most of the kinetic orders (g_ij, hij) down to zero. In the meantime, if the number of genes that directly affect the gene i is smaller than I, this term will not penalize. In such case, the optimal solutions to the

fitness functions Eq. 5.2 and Eq. 5.4 are identical. To reduce the computation cost, the structure skeletalizing technique [35] was applied. This technique assigns a value of zero to the kinetic orders when their absolute values are less than a given threshold δ_s. In this study, δ_s = 3 × 10⁻².

5.2 The Proposed Intelligent Two-stage

在文檔中運用智慧型基因演算法最佳化微陣列資料分析 - 可語意解讀基因表現量分類器之設計暨基因網路模型之重建 (頁 59-63)