Chapter 2 Materials and Methods
2.2 Optimization
surface area of atom type i. We use a probe radius of 1.4 Å to simulate a water molecule to calculate atomic solvent-accessible surface area. We divide all atoms into six atom types: C, non-charged O, non-charged N, S, charged O, and charged N, used with Eisenberg et al. [50, 51]. Table 1 shows the detail description of each energy term.
2.2 Optimization
The core idea of our evolutionary approach was to design multiple operators that cooperate using the family competition model [46, 47], which is similar to a local search
procedure. The Gaussian and Cauchy mutations, continuous genetic operators, search the weights for the energy terms.
2.2.1 Representation and Initiation
Our optimized method works as follows: It randomly generates a starting population with N solutions of weights for energy function. Each solution is represented as a set of 3n-dimensional vectors (xi, σi, ψi), where n is the number of energy terms of an energy function and i = 1,…,N, where N is the population size. The vector x is the adjustable variables representing a particular weights of a energy term to be optimized. σ and ψ are the step-size vectors of decreasing-based Gaussian mutation and self-adaptive Cauchy mutation.
In other words, each solution x is associated with some parameters for step-size control. The initial step size σ is 0.8 and ψ is 0.2.
2.2.2 Family Competition Evolutionary Algorithm (FCEA)
After initializes the solutions, it enters the main evolutionary loop, which consists of 2 stages in everyone iteration: decreasing-based Gaussian mutation and self-adaptive Cauchy mutation. Each stage is realized by generating a new quasi-population (with N solutions) as the parent of the next stage. These stages apply a general procedure “FC_adaptive” with only different working population and the mutation operator. The FC_adaptive procedure employs 2 parameters, namely, the working population (P, with N solutions) and mutation operator (M), to generate a new quasi-population.
The main work of FC_adaptive is to produce offspring and then conduct the family competition. Each individual in the population sequentially becomes the “family father.”
With a probability pc, this family father and another solution that is randomly chosen from the rest of the parent population are used as parents for a recombination operation. Then the
new offspring or the family father (if the recombination is not conducted) is operated by differential evolution to generate a quasi-offspring. Finally, the working mutation is operates on the quasi-offspring to generate a new offspring. For each family father, such a procedure is repeated L times, called the family competition length.
Among these L offspring and the family father, only the one with the lowest scoring function value survives. Since we create L children from one “family father” and perform a selection, this is a family competition strategy. This method avoids the population prematureness but also keeps the spirit of local searches. Finally, the FC_adaptive procedure generates N solutions because it forces each solution of the working population to have one final offspring. In the following, genetic operators are briefly described. We use a
= (xa, σa, ψa) to represent the “family father” and b = (xb, σb, ψb) as another parent. The offspring of each operation is represented as c = (xc, σc, ψc). The symbol xsi is used to denote the ith adjustable optimization variable of a solution s, ∀i {1,…, N}. ∈
2.2.3 Recombination and Mutation Operators
We implemented modified discrete recombination and intermediate recombination. A recombination operator selected the “family father (a)” and another solution (b) randomly selected from the working population. The former generates a child as follows:
⎪⎩
The generated child inherits genes from the “family father” with a higher probability 0.8. Intermediate recombination works as
2
where w is σ or ψ based on the mutation operator applied in the FC_adaptive procedure.
The intermediate recombination only operated on step-size vectors and the modified discrete recombination was used for adjustable vectors (x).
After the recombination, a mutation operator, the main operator of our evolutionary approach, is applied to mutate adjustable variables (x). Gaussian and Cauchy Mutations are accomplished by first mutating the step size (w) and then mutating the adjustable variable x:
) is a self-adaptive mutation, where N(0,1) is the standard normal distribution, Ni(0, 1) is a new value with distribution N(0, 1) that must be regenerated for each index i. When the mutation is a decreasing-based mutation A(·) is defined as a fixed decreasing rate γ = 0.95.
D(·) is evaluated as N(0, 1) or C(1) if the mutation is, respectively, Gaussian or Cauchy. For example, the self-adaptive Cauchy mutation is defined as
). evolution strategies. A random variable is said to have the Cauchy distribution [C(t)] if it has the density function: . In this thesis, t is set to 1.
Our decreasing-based Gaussian mutation uses the step-size vector σ with a fixed decreasing rate γ = 0.95 and works as σ
2.2.4 Objective Function of FCEA
For optimization, the energy function becomes
∑
=
i wi i
E µ , (15)
where wi is the weight of energy term µi. Given an energetic weight set w (in this thesis, it’s 8 and 9 for MOLSIM and GEMSCORE, respectively), we used FCEA to look for the most suitable energy function by minimizing a well-developed objective function. Eight energy terms of MOLISM are define in Equations 2 and 5 and nine energy terms of GEMDOCK are define in Equations 7 and 10.
A successful energy function not only has to be able to correctly distinguish between native and native-like structures but must also do so convincingly. In the regard, the quality of an energy function is judged by the size of energy gap assigns to the native structure and the average energy of the rest of the non-native structures. A mostly used measure for assessing this quality is the Z-score. We defined the Z-score as follows:
E
Enative is the energy value of a native structure, Ed is the energy score of a decoy structure d, and <E> is the mean of energy values of all non-native structures in a decoy set. Z-score is used for measuring the energy separation between the native structure and the other decoy structures in the units of the standard deviation of the ensemble. The Z-score above is only for a single protein. While we seek the weights of an energy function, we makes Z-scores of
“all” proteins simultaneously low enough, that is, negative and large in the absolute value.
We need an objective function that reflects the Z-scores of the many proteins in the training set. We have many approaches to develop an objective function. For example, we minimize the summation of Z-scores over proteins and we obtain an energy function that gives small Z-scores for many proteins but large Z-scores for a few proteins. This is not desired because the Z-scores of “all” proteins need to be small. This problem arises partly because proteins in the training set have different quality decoys and the current energy function allows some proteins to be more easily recognized than others. Another approach may be to minimize the maximum of Z-score of proteins in the training set. However, the optimized weights are probably determined by few proteins, which may be exceptional ones that could be structures with some errors. This tends to lead to unreasonable energy weights.
Some intermediate approaches have been proposed by Koretke et al. [56] as well as by Mirny and Shakhnovich [57].
To avoid the effects of few proteins and minimize the Z-scores of “all” proteins, we chose a normalized Z-scores. When the Z-scores of some proteins are very small, it is difficult to optimize other proteins which have large Z-scores. In order to reduce the ill effect mentioned above, we normalize the Z-scores with
z z
which f(z) maps Z-scores to the value among 0 and 1.
Even the Z-score is low enough; it does not guarantee the native structure can be distinguished from the decoy set, because Z-score is based on the different value between energy value of the native structure and the mean of energy values of all non-native structures in a decoy set. To ensure that energy value of a native structure is the lowest
among a decoy set, we added a related measure Z’-score [30] given as
where Elowest is the lowest energy value of the non-native structure among the decoy set, and
∆E is the same used in Z-score (in Equation 17). In contrast to the Z-score, the Z’-score gives a quantitative measure of how well separated the native structure is from its lowest energy neighbor from within the decoy set. We applied the same normalized approach into Z’-score. Finally, our objective function of evolutionary approach is
∑
+where i is the protein i, and N is the number of proteins in a training set. We minimize the score S to find out optimal weights for energy terms of our energy functions.