METHODS - 生物系統內分子交互作用及生化路徑之大規模分析---子計畫二:智慧型最佳化方法用於基因網路的重建與分析(III)

These proposed models describe the networks through their inner parameters. When the scale of network is increased, the number of parameters that need to be optimized is also significantly increased. An effective computational gene network model should realize the observed dynamic gene expression and reflect the regulatory relationship between genes. For this reason, gene network is often reconstructed from the observed gene expression profiles. To describe the network appropriately, it is inevitable to optimize the inner parameters of model through the observed data. Because of limited resource and time, the available data is too

few to prevent model uncertainty. Optimization methods will be proposed to solve the problem of model uncertainty when modeling gene networks from insufficient data and noise.

3.1 Efficient and robust method: GRNet

NCA is known as a useful analysis tool for GRNs. The main concept of NCA is shown as Figure 1.2.

The gene expression level is determined from transcription factor (TF) activities and control strength (CS) for specific gene. The linear model of NCA is shown in equation (2):

E = CS ⋅TFA + Γ

(2)

where E is the expression profiles of genes that measured by DNA microarray. CS is composed of the regulation between TFs and genes. TFA represents activities of associated transcription factors. NCA decomposes data matrix E into CS and TFA by minimizing Γ under the constraints that the network structure or the non-zero pattern of the matrix CS is conserved [1]. E is the known M × T matrix that contains

M genes and T time points under specific condition. CS is the unknown M × N matrix considers N

expression data and known TF-gene connectivity. A mathematical model of NCA is defined as equation (3).

We aim to optimize both signs and magnitudes of TF activities and CS while considering noisy gene expression data. The initial values of CS are known regulation between gene and TF from public literature and database and in the set of [1, -1, 0] to represent up-regulation, down-regulation, and no regulation respectively.

The model reconstruction is formulated as an optimization problem where least square error (LSE) between the known and estimated expression data is used as an objective function to be minimized. LSE is defined in equation(4). High performance of GRNet arises mainly from an orthogonal simulated annealing algorithm [36]

to solve the large-scale optimization problem.

LSE

= ( E⎡⎣ ⎤⎦ − cs⎡⎣ ⎤⎦ tfa⎡⎣ ⎤⎦)² (4) With achievements of the lower LSE, the regulation between TFs and genes should be noticed for better understanding of estimated GRN. An evaluation function is defined in equation(5). We check the signs of final [cs] to [CS] from prior knowledge that demonstrate if TF induce or repress specific gene or not. This information will help us to identify precise TF-gene network structure for unknown or violated information from literature or published database.

standalone TF. Regarding to TFs act in combination on promoter, it’s also an important regulations to control gene expression. Hence, Temporal GRNet (tGRNet) is proposed for more straightforward GRNs. The idea of tGRNet is based on the strength of [CS] should be determined as time goes by. The control strength for each gene can be summarized with TFs, TFs in combination, and even feedback by gene in next time points. This model is extended from NCA and defined as temporal NCA (tNCA). tGRNet implements tNCA to validate the practicability for GRNs. pseudo expression data, N(0, σ²) is a normal distributed random number function with zero mean and variance σ². Here, σ is assigned as Xobs,i,t × k%. performance of using multiple data sets is better than that of using a single data set in terms of solution quality, and 3) the effectiveness of iTEAP using a single data set is close to that of iTEA using two real data sets. The obtained model can be validated by biological experiments and known knowledge.

The goal of iAEA, an improved version of iTEA and iTEAP, in GNP is to solve the infinite solutions of the S-system model for efficiently establishing large-scale GRNs by incorporating the domain knowledge of gene regulation into the proposed evolutionary computation method. The novel encoding chromosomes used the intelligent genetic algorithm.Because of the connectivity of the genetic network has been known to be sparse, domain knowledge was provided for the encoding chromosome. Let I is a maximum in-degree of the maximal number of genes that directly affect gene. The iAEA uses a hybrid encoding method that consists of regulation strength, gene number regulated, and binary control parameters in a chromosome.

3.2.1 Chromosome encoding method

The chromosome representation for each gene i, shown in Figure 3.1, consist of three parts: 1) rate constants, 2) kinetic orders, and 3) control parameters. αi and βi are rate constants that indicate the direction of

mass flow. giLij and hiLij are kinetic orders that reflect the intensity of interaction from gene Lij to i, where Lij

belongs to {1, 2, …, N} and j=1,.., I. Mgij

is a mask parameter of a positive kinetic order which the value 1

represents the edge of gene Lij to i in the structure of the gene regulatory network is connected. And zero represents the edge is disconnected. Similarly, Mhij

is a mask parameter of a negative kinetic order. Mg

and Mh

ij belong to {0,1}. The two sets {αⁱ,

β

i, giLi1,…, giLiI, hiLi1,…, hiLiI } and {Li1, …, LiI, Mgi1,…, MgiI Mhi1,…,

Mh

iI } of S-system parameters are real and integer values, respectively. There are 2×I+2 real variables and 3×I integer variables in our genetic algorithm.

Figure 3.1: Chromosome representation.

3.2.2 iAEA Algorithm

The algorithm for solving subproblems is given as follows:

Step 1:Randomly set the connected state of gene regulation in each subproblem.

Step 2:Initiation: Randomly generate an initial population with Npop feasible individuals of 2×(I+1) real-valued parameters and 3×I integer-value parameters.

Step 3:Evaluation: Evaluate fitness values of all individuals.

Step 4:Selection: Use the simple ranking selection that replaces the worst Ps×Npop individuals with the best

P

s×Npop individuals to form a new population, where Ps is a selection probability. Let Ibest

be the best

individual in the population.

Step 5:Crossover: Randomly select Pc×Npop individuals including Ibest, where Pc is a crossover probability.

Perform perturbation intelligent crossover operations for all selected pairs of parents.

Step 6:Mutation: Apply the two different mutation operators to real-value and integer-value of the population using a mutation probability Pm. To prevent the best fitness value from deteriorating, mutation is not applied to the best individual.

Step 7:Repeat 50 times from step2 to step 6.

Step 8:Selected the fitness of subproblem was solved good enough from the 50 times experiments, then counted the number of gij, and hij.

Step 9: Termination test: If fitness evaluation is achieved in this 50 times experiments, then stop the algorithm.

Otherwise, according statistical result from Step 8, set the connected state was fixed or unfixed in each gene, then go to step 2.

3.3 Integration of gene network platform

The architecture of GNP is shown as Figure 3.2. The optimization core was completed at the first year based on proposed methods for gene networks reconstruction considering with or without transcription factors.

The two tasks in this year are integration of biomolecular interaction data warehouse built in sub-project 3 and

α

Main components in GNP are described as follows:

l Optimization core: this unit serves as computational core based on proposed methods. Different groups used to share information and balance the computation load. Parallel computation is introduced to optimize gene network based on our previous work “Developing a parallel intelligent optimization system based on evolutionary algorithm for genetic network modeling”

(NSC-95-2221-E-009-116). In addition to our achievements, we consider transcription factors in dedicated application of GRNs, insufficient data and experimental noise for modeling in this project.

l GNP portal site: A portal server to provide user-friendly interface for biologists.

l Platform controller: The controller can determine how our system to work. Query from data warehouse or activate optimization core to inference gene network based on information for specific gene network.

l Biomolecular Interaction Data Warehouse: In sub-project 3, a data warehouse of GN was built and up-to-date from latest literature and databases.

GNP allows biologists to select interesting genes to reconstruct their gene network before performing experiments. In Figure 1.2, grey arrow indicates the working flow between components of GNP and the number is execution order when requests come in. GNP portal let them to select species they want in GNP if no more prior knowledge provided. Additional option for TFs, constraints of gene-gene or gene-TF interaction can also be input in GNP. The GNP controller receives command to reconstruct gene network for user and performs query from biomolecular interaction data warehouse if data is sufficient or not (steps 3 and 4) and response the correct gene network information to user. Otherwise, GNP controller forwards the conditions to optimization core to perform optimization of specific gene network (step 7) and asks user to wait for computation results. The computational cost depends on the scale of gene network and may take very long time to find out a nearly optimized solution. GNP controller controls Job scheduler and resource allocation between computational groups. After optimization is done, optimization core will update best results to data warehouse (steps 8 and 9) and notify biologists by email that query results are available (step 10).

Due to the inter-exchange between components of GNP, it may involve lots of data and computational costs. Two queue systems are used for the optimization core: divide and conquer, and paralleling computing.

We use distributed architecture of GNP which can do benefit from:

l It’s easy to extend if computational power is insufficient. Cluster for hyper computational requirements is ready for years.

l GNP controller will dispatch tasks to optimization core and update data warehouse automatically. In additional to collect from other databases and literature, data warehouse for gene network will be up-to-date frequently if more biologists use GNP.

Figure 3.2: The architecture of GNP for reconstructing gene network: the yellow, green, and blue areas represent the involved parts for first to third year in our sub-project. The red area is done by other project.

在文檔中生物系統內分子交互作用及生化路徑之大規模分析---子計畫二:智慧型最佳化方法用於基因網路的重建與分析(III) (頁 10-15)