Evolutionary Algorithms - 以高斯演化方式預測蛋白質-蛋白質嵌合位置

Chapter 1 Introduction

1.3 Evolutionary Algorithms

An evolutionary algorithm is based on ideas genetics and natural selection to solve problems, especially difficult optimization problems. The first step is coded to a chromosome, which is represented as solution and initialed randomly. The second step is that find out batter solution in searching space by mutations and crossovers.

The final step is to find a reliable solution by repeating mutation, crossover, and selection again and again. Selection is most important function which is decided to pick up the batter solution in an evolutionary algorithm.

In general the coding function of genetic algorithms has two main developed but strongly related implementations. First coding function is binary-represented which may introduce an additional multimodality, making the combined objective function more complex than the original function. To achieve better performance, second coding function is real-coded genetic algorithms. However, they generally employ random-based mutations and hence, still require lengthy local searches near local minimums. In contrast, evolution strategies and evolutionary programming [52], mainly use real-valued representation and focus on self-adaptive Gaussian mutations.

This type of mutation has succeeded in continuous optimization and has been widely regarded as a good operator for local searches. Unfortunately, experiments show that self-adaptive Gaussian mutation leaves individuals trapped near local minimums for rugged functions.

GEM is proposed to improve the above approaches and has a good performance for many problems. GEM is a multi-operator approach that combines three mutation operators: decreasing-based Gaussian mutation, self-adaptive Gaussian mutation, and self-adaptive Cauchy mutation. It incorporates family competition and adaptive rules for controlling step sizes to construct the relationship among these three operators. To balance the search power of exploration and exploitation, each of operators is

designed to compensate for the disadvantages of the other. The details of GEM were described as previous works [21, 23], and had been successfully applied for some specific problems, such as protein-ligand docking, drug screening, and protein side-chain prediction [18-20, 22, 23]. In this study, we use the GEM to find out the most suitable weight of each parameter for protein-protein interacting interface prediction.

8 2 Materials and Methods

2.1 Flowchart of our method

Prepare training data set and optimize the parameters

Figure 1. Flowchart of the GEM training and prediction strategies

Figure 1 shows the scheme of our method for predicting protein-protein

interaction binding site. First, we prepared a training data set which consisted of 52 complexes (104 chains) from a widely used benchmark [53]. For each chain in this data set, the protein surface and interacting interface are generated. Based on these data, we calculated all surface residues scores and trained the GEM parameters to distinguish interacting residues from non-interacting residues by ranking scores of surface residues. The surface residue with the score lower than a given threshold was predicted as an interface residue. To each predicted area, we apply specificity and sensitivity [45] to measure performance of our method. Specificity was defined as number of interface residues in predicted area/number of predicted area residues.

Sensitivity was defined as number of interface residues in predicted area/number of interface residues. A prediction was deemed a success if predicted area with over 50%

specificity [16]. The results were than used to calculate accuracy of success, and GEM tuned atomic and structure parameters of scoring function using global continuous search to find best solution of average specificity and success rate from 104 training proteins. Second, in order to prediction, for each protein, we generated surface residues and these residues could then be calculated their scores using the optimized parameters from above. Residues predicted to be part of the interface were according to the scores lower than a given threshold.

10 2.2 Training data set

Here we used 104 unbound protein chains according to 52 protein complexes selected from the protein-protein docking benchmark [53] which was a non-redundant benchmark for testing protein-protein docking algorithms. We discarded 7 complexes, which have large structure conformation change between unbound and bound structures, from the original benchmark. Our data set consisted of 22 (44 chains) enzyme-inhibitor complexes, 19 (38 chains) antibody-antigen complexes, and 11 complexes (22 chains) others. Table 1 list the training set using in this study.

Table 1. List of training data set

Complexes Receptor Receptor description Ligand Ligand description

Enzyme-inhibitor (22)

1ACB(E:I) 5CHA(A) α-Chymotrypsin 1CSE(I) Eglin C

1AVW(A:B) 2PTN Trypsin 1BA7(A) Soybean trypsin inhibitor

1BRC(E:I) 1BRA Trypsin 1AAP(A) APPI

1BRS(A:D) 1A2P(B) Barnase 1A19(A) Barstar

1CGI(E:I) 1CHG α-Chymotrypsinogen 1HPT Pancreatic secretory trypsin inhibitor

1CHO(E:I) 5CHA(A) α-Chymotrypsin 2OVO Ovomucoid 3rd domain

1CSE(E:I) 1SCD Subtilisin Carlsberg 1ACB(I) Eglin C

1DFJ(I:E) 2BNH Ribonuclease inhibitor 7RSA Ribonuclease A

IFSS(A:B) 2ACE(E) Snake venom acetylcholinesterase 1FSC Fasciculin II

1MAH(A:F) 1MAA(B) Mouse acetylcholinesterase 1FSC Fasciculin II

1TGS(Z:I) 2PTN Trypsinogen 1HPT Pancreatic secretory trypsin inhibitor

1UGH(E:I) 1AKZ Human Uracil-DNA glycosylase 1UGI(A) Inhibitor

2KAI(AB:I) 2PKA(XY) Kallikrein A 6PTI Trypsin inhibitor

2PTC(E:I) 2PTN β-Trypsin 6PTI Pancreatic trypsin inhibitor

2SIC(E:I) 1SUP Subtilisin BPN 3SSI Subtilisin inhibitor

2SNI(E:I) 1SUP Subtilisin Novo 2C12(I) Chymotrypsin inhibitor 2

1PPE(E:I) 2PTN Trypsin 1PPE(I) CMT-I

1STF(E:I) 1PPN Papain 1STF(I) Stefin B

1TAB(E:I) 2PTN Trypsin 1TAB(I) BBI

1UDI(E:I) 1UDH Virus Uracil-DNA glycosylase 1UDI(I) Inhibitor

2TEC(E:I) 1THM Thermitase 2TEC(I) Eglin C

4HTC(LH:I) 2HNT(LCEF) A-Thrombin 4HTC(I) Hirudin

Antibody-antigen (19)

1AHW(DE:F) 1FGN(LH) Antibody Fab 5G9 1BOY Tissue factor

1BVK(DE:F) 1BVL(LH) Antibody Hulysll Fv 3LZT Lysozyme

1DQJ(AB:C) 1DQQ(LH) Hyhel–63 Fab 3LZT Lysozyme

1MLC(AB:E) 1MLB(AB) IgG1 D44.1 Fab fragment 1LZA Lysozyme

1WEJ(LH:F) 1QBL(LH) IgG1 E8 Fab fragment 1HRC Cytochrome C

1BQL(LH:Y) 1BQL(LH) Hyhel-5 Fab 1DKJ Lysozyme

1EO8(LH:A) 1EO8(LH) Bh151 Fab 2VIU(A) Influenza virus hemagglutinin

1FBI(LH:X) 1FBI(LH) IgG1 Fab fragment 1HHL Lysozyme

1IAI(MI:LH) 1AIF(LH) IgG1 Idiotypic Fab 1IAI(LH) Igg2A anti-idiotypic Fab

1JHL(LH:A) 1JHL(LH) IgG1 Fv fragment 1GHL(A) Lysozyme

1KXQ(D:E) 1PIF(A) α-Amylase 1KXQ(E) Camelid AMD9 Vhh domain

1KXT(A:B) 1PIF(A) α-Amylase 1KXT(B) Camelid AMB7 Vhh domain

1KXV(A:C) 1PIF(A) α-Amylase 1KXV(C) Camelid AMD10 Vhh domain

1MEL(B:M) 1MEL(B) Vh single-domain antibody 1LZA Lysozyme

1NCA(LH:N) 1NCA(LH) Fab NC41 7NN9 Influenza virus neuraminidase

1NMB(LH:N) 1NMB(LH) Fab NC10 7NN9 Influenza virus neuraminidase

1QFU(LH:A) 1QFU(LH) Igg1-k Fab 2VIU(A) Influenza virus hemagglutinin

2JEL(LH:P) 2JEL(LH) Jel42 Fab fragment 1POH A06 phosphotransferase

2VIR(AB:C) 2VIR(AB) Igg1-lamda Fab 2VIU(A) Influenza virus hemagglutinin

Others (11)

1AVZ(B:C) 1AVV HIV-1 NEF 1SHF(A) FYN tyrosin kinase SH3 domain

1L0Y(A:B) 1BEC T-cell receptor βchain 1B1Z(A) Exotoxin A1

1WQ1(G:R) 1WER RAS activating domain 5P21 RAS

2MTA(LH:A) 2BBK(LH) Methylamine dehydrogenase 1AAN Amicyanin

2PCC(A:B) 1CCA Cytochrome C peroxidase 1YCC Iso-1-Cytochrome C

1A0O(A:B) 1CHN Che A 1A0O(B) Che Y

1ATN(A:D) 1ATN(A) Actin 3DNI Deoxyribonuclease I

1GLA(G:F) 1GLA(G) Glycerol kinase 1F3G GSF III

1IGC(LH:A) 1IGC(LH) IgG1 Fab fragment 1IGD ProteinG

1SPB(S:P) 1SUP Subtilisin 1SPB(P) Subtilisin prosegment

2BTF(A:P) 2BTF(A) β-Actin 1PNE Profilin

12 2.3 Atomic and structure parameters for protein interface prediction

In order to describe interface behavior clearly, we used atomic and structure parameters which combined attributes of hydrophobic properties and structure properties to achieve this purpose. The atomic parameters consisted of 10 hydrophobic parameters δ based on octanol/water transfer energies optimized for protein-protein binding [12, 13] and the structure parameters consisted of 8 secondary structure parametersσ based on DSSP [24] as described in Table 2. Training set of 104 high-resolution crystal structures was then used to train the overall parameters by GEM.

Table 2. Atomic parameters used for protein-protein interaction sites prediction

Atom type

δ

2^nd structure type

σ

C aliphatic -8.607 H 3.06

C aromatic -257.134 G 15.12

N uncharged 11.500 I -5.14

Nζ in Lys+ 295.247 B -13.74

Nη1,Nη2 in Arg+ 26.958 E 12.22

O hydroxyl 6.091 T 15.22

O carbonyl 1.619 S 18.18

O- in Glu, Asp 41.291 others 13.84

S in SH -281.919

S in Met -298.668

2.4 Definition of surface and interface

Surface and interface residues for the proteins were identified based on information in the PDB coordinate files. Briefly, solvent accessible surface areas (ASA) for each residue are calculated using DSSP [24]. A residue is defined to be a surface residue if its ASA is at least 25% of its nominal maximum area [54] as defined by Rost and Sander [55] . The distance-based definition of interface used here was that a surface residue is defined as an interface residue if its side-chain center is within 4.5Å of the side-chain center of a residue belonging to another chain in the complex. There is no indication that this criterion is better or worse than others [56].

Ofran et al. [49], for example, used a cut-off distance that if any of residue atoms <

6Å from any atom of the other protein. Sternberg and coworkers [57] set a cut-off distance of 8 Å between the C^α atoms.

14 2.5 Definition of different-sized surface patch

E ⁱ= +5.1 E ⁱ= -8.9 E ⁱ= -10.2

Figure 2. Flowchart of identifying different-sized surface patch

Figure 2 showed the procedures of identifying different-sized surface patch. A

patch was defined as a surface area decided by the radius size of a residue at a protein surface as a centre. Thus, a protein with 100 surface residues would have 100 surface patches. To ensure that we did not measure through the protein, the following procedure was followed. (A) A Cβ of every surface residue on the protein (a Cα for glycine) was used to calculate distances between all surface residues. The patch was grown from the single starting (seed) residue and subsequently used to generate series of surface patch. However, equal-area patches may poorly represent the real protein interaction sites that vary significantly in size. Therefore, (B) different-sized surface patches were generated by selecting all surface residues at different distances (d = 1, 2, …, 20Å) from a given seed [13]. (C) The final patch size of the seed was according to the lowest score of a surface patch. Scores of different-sized surface patches were calculated based on the atomic ASA of their component residues and secondary

structure (section 2.7). This process was iterated using the newly acquired residues, until the total number of residues in the patch was equal to the total number of residues in the surface.

2.6 Definition of specificity, sensitivity and success

To each predicted area, we apply specificity and sensitivity [45] to measure performance of our method. In this study, specificity was defined as number of interface residues in predicted area/number of predicted area residues. Sensitivity was defined as number of interface residues in predicted area/number of interface residues.

A prediction was deemed a success if predicted area with over 50% specificity [16].

16 2.7 Scoring function of surface residue

The desolvation properties of a given surface residue were calculated using an atomic ASA-based model [58, 59], and we combined new attribute of secondary structure properties to consider interface structure information.

The surface residue score S of residue k is calculated as ,

where m is total number residues of different-sized surface patch. δi is the atomic parameter for atom type i. σi is the structure parameter for 2nd structure type i. ASAi is solvent accessible surface area of atom i. The atomic parameters for the different atom types are shown in Table 2. The lowest score of different-sized surface patch is selected for the final score of residue k.

To assess how hydrophobic a surface residue is with respect to the whole protein, the Z score of its hydrophobic pocket is calculated as

γ

S Z

_k =

S

^k−

where

^S

is the average surface patch score of all of the residues in the protein surface, and

γ

is the standard deviation. Residues with Zk lower than a given threshold (Zcut) are taken as the hydrophobic pockets and we find Zcut = -1.5 has better performance.

2.8 Training procedure of GEM

GEM is a multi-operator approach that combines three mutation operators:

decreasing-based Gaussian mutation, self-adaptive Gaussian mutation, and self-adaptive Cauchy mutation. It incorporates family competition and adaptive rules for controlling step sizes to construct the relationship among these three operators. To balance the search power of exploration and exploitation, each of operators is designed to compensate for the disadvantages of the other. The details of GEM were described as previous works [21, 23], and had been successfully applied for some specific problems, such as protein-ligand docking, drug screening, and protein side-chain prediction [18-20, 22, 23].

Here, we provide an outline of our GEM for predicting protein-protein interaction sites, which can be represented by adjustable variables of atomic and structure parameters (Table 2) as

where δ is the atomic parameter and σ is the structure parameter of surface residue scoring function. The values of parameters are then used in the surface residue scoring function and predicting results are presented as specificity and success rate. In order to determine that the performance of adjustable parameters, we use a fitness function which combines specificity and success rate for GEM training. In this work, we use GEM to look for the most suitable atomic and structure parameters for identifying protein interfaces by minimizing a fitness function which is described as fellow

,

where N is total number of training proteins. ψp is the specificity of predicting results of the protein p based on the training values of atomic parameters. In order to raise success rate of prediction, we add information of success rate to our fitness function

18

(ωp). The value of ωp is depend on ψp. When ψp is over 0.5 or equal to 0, than ωp

would be set to -1 or 0.5, respectively. If ψp is lower than 0.5 and not equal to 0, then

ω

p would be set to 0.

Generally, the method use global continuous-search mechanisms based on Gaussian mutations. And the steps involved are as follows:

1. Initialize the atomic and structure parameters of surface residue scoring function.

The initial values for the parameters are selected from the feasible region (-300, 300). Repeat this N times to generate the initial population of N parameters for a surface residue scoring function. Evaluate the objective value of each parameters based on the fitness function.

2. Change the value of atomic and structure parameters by genetic operators to generate offspring. Evaluate the objective values of the offspring.

3. Use selection operators to select N solutions from the atomic and structure parameters of both parent and offspring solutions.

4. Repeat steps 2 and 3 until one of the terminating conditions is satisfied.

The GEM parameters used in this paper are listed in Table 3 such as population size, initial step sizes of Gaussian mutations, recombination probability, and family competition length in this work. The GEM optimization stops when either the convergence is below certain threshold value or the iterations exceed a maximal preset value which was set to 200. These parameters were selected after many attempts to predict interaction sites for test proteins with various initial values.

Table 3. Gaussian Evolutionary Method parameters

Parameter Value

Population size 200

Step size of Gaussian mutations ν = 0.2 and λ= 0.8 (in radius) Recombination probability 0.2

Family competition length L = 3 Number of maximum generations 200

3 Results

3.1 Training results

The results of the GEM method for the training set are summarized in Table 4.

GEM is able to predict the location of the interface on 65.4% (68/104) proteins in the training dataset. The average specificity and sensitivity are 57.8% and 27.8%, respectively. If we only train atomic parameters by GEM and use optimized atomic parameters to predict training set, the success of prediction is decreasing to 51.0% and the average specificity and sensitivity are 44.9% and 27.6%, respectively. In addition, if we used atomic parameters based on Fernandez-Recio et al. [13] without any optimization, the performance of prediction is the worse. Combination of physical–chemical properties and using computational methods to assist the finding of best parameters are useful for predicting protein-protein interaction sites.

Table 4. Summary of training results from 104 unbound proteins

Atomic and 2^nd parameters Success Specificity Sensitivity Success Specificity Sensitivity Success Specificity Sensitivity

Enzyme

The parameters which are optimized by GEM

The parameters which are based on Fernandez-Recio et al.[13]

20 3.2 Testing results

The overall accuracy of GEM in predicting the protein-protein interaction sites of 50 test proteins is shown in Table 5. In order to test our performance of predicting protein-protein interaction sites, we tested our parameters against a accompanying paper of Fernandez-Recio et al. [13] and found that our Atomic Parameters had performed at least as well as their Atomic Solvation Parameters. The results of this test are summarized in Table 5, in which it can be seen that our method can predict 98% proteins among whole testing set and have 42.3% average specificity, better than Fernandez-Recio’s results which can predict 60% protein among whole testing set and have 37.8% average specificity.

Table 5. Prediction specificities of 50 unbound proteins PDB GEM Fernandez

Recio PDB GEM Fernandez

Recio PDB GEM Fernandez Recio

a*. — means that there are no results of prediction

3.3 Rigid-body protein-protein docking using GEMDOCK

(A) 1AVW (A:B) (B) 1FBI (LH:X) (C) 2BTF (A:P)

Figure 3. Good test cases for GEMDOCK. Hits within 2.0 Å RMSD were found

for (a) 1AVW, (b) 1FBI, (c) 2BTF. The bound receptor surface is shown. The best ranked hit is shown in blue, the original bound ligand is shown in red.

We have modified GEMDOCK for rigid-body protein-protein docking and using original empirical scoring function which works well in protein-ligand docking. The former combines both discrete and continuous global search strategies with local search strategies to speed up convergence, whereas the latter results in rapid recognition of possible protein-protein interacting conformations. We have tested on 52 bound protein complexes which are used in our training set and the results are listed on Table 6. The results show that modified GEMDOCK predicts 3 times for each complex and the performance of enzyme-inhibitor (50%) better than antibody-antigen (11%) and others (27%). Figure 3 shows that modified GEMDOCK could give us confident binding conformations in some good test cases, and RMSD of these cases are smaller than 2Å. However, the overall performance is not satisfied

22

since scoring function using here is for protein-ligand docking, fortunately, the search strategies of GEMDOCK is work for protein-protein docking. In the future, we will improve scoring function of GEMDOCK for protein-protein docking and develop soft-body protein-protein docking strategies for solving unbound-unbound protein docking problems.

Table 6. Results of protein-protein docking using GEMDOCK

Complexes ∆ASA

(Å

) R2Å

R5Å

Best RMSD

(Å)

1WEJ(LH:F) 1180 0 0 49.428

a∆ASA: change in accessible surface area (ASA) on complex formation was calculated, by using the program NACCESS.[53]

bR2Å: Number of predictions with RMSD smaller than 2 Å among 3 rounds

cR5Å: Number of predictions with RMSD smaller than 5 Å among 3 rounds

dBest RMSD: The smallest RMSD among 3 rounds

24 4 Discussion

Figure 4 shows six examples of the prediction outcome of the training set (figure 4a, 4b and 4c) and testing set (figure 4d, 4e and 4f). Predicted interface and

non-interface residues, identified by the GEM, are shown as color coded patches as follows: Red spheres = true positives (TP), actual interface residues that are predicted as such; Blue strands = true negatives (TN), non-interface residues that are predicted as such; Yellow spheres = false negatives (FN), interface residues that are misclassified as non-interface residues; Green spheres = false positives (FP), non-interface residues that are misclassified as interface residues. From the figure 4, one clearly sees that not all the interface was predicted, but that the predicted part fits the interface well.

(a) (b) (c)

(d) (e) (f)

True Positives False Positives False Negatives True Negatives

Figure 4. Prediction results of training set and testing set. The partner molecule(s)

in the bound conformation after superimposition of the corresponding molecule in the complex is represented in ribbon. (a) Prediction on 1dqj_r of the Hyhel-63 Fab, (b) prediction on 1dfj_r of the ribonuclease inhibitor, (c) prediction on 1acb_r of the α-Chymotrypsin, (d) prediction on 2cpl of the Cyclophilin a, (e) prediction on 1ctm of the Cytochrome f, (f) prediction on 1a19A of the Barstar.

(c) (d)

(a) (b)

Figure 5. Case study of limitations. (a) 1ahw_l, green : prediction area, yellow :

interface, blue : others, red block : fibronectin type III modules; (b) The target protein 1wej_l is shown in ribbons, green : prediction area, yellow : interface, blue : others, purple : heme; (c) pink : 1noc A chain, green : 1noc B chain and grey : 1nos; (d) 1pco, green : prediction area, yellow : interface, blue : others; and 1eth A chain (ribbon with pink color).

There are some limitations in the current implementation of the method. Figure 5 shows the limitations in the performance between the training set (figure 5a and figure

5b) and testing set (figure 5c and figure 5d). Figure 5a shows the structure of 1ahw_l.

Although our prediction area is far from the interface, this structure consists of two fibronectin type III modules whose hydrophobic cores merge in the domain-domain interface and our prediction is almost invariably symmetrical. Figure 5b shows the structure of 1wej_l. The prediction of our method is located nearby heme propionate, this result may due to the residues nearby the heme are more hydrophobic than protein-protein interaction site. Figure 5c shows the structures of bound protein complex : 1noc A chain and B chain and unbound protein : 1nos. After structure

26

alignment of 1noc A chain and 1nos, unfortunately, contact residues between 1nos and 1noc B chain are less than the bound protein complex, and it is difficult for our method to identify interaction site of 1nos. Figure 5d shows the structure of 1pco and 1eth A chain. The surface of 1eth A chain (colipase) can be divided into a rather hydrophilic part, interacting with 1pco (lipase), and a more hydrophobic part, formed by the tips of the fingers [60]. This suggests that interface of 1pco is more hydrophilic than the surface, and our method do not prove to be very useful in this case.

5 Conclusion

We have developed a method for predicting protein-protein binding sites using GEM. To train the GEM and to test the prediction method we collected dataset of 104 unbound proteins—the nonredundant benchmark for testing protein-protein docking algorithms. We were able to successfully predict the location of the binding site on 65.4% of the 104 proteins in training set. In addition, we tested GEM to predict 50 unbound proteins and had 46% successfully prediction in testing set. The performance were achieved using only 18 attributes so prediction results should be improved when more properties that distinguish between interfaces and the rest of the protein surface become available.

This method can be further improved on several aspects. First, we notice that hydrophilic effect may be main force of protein-protein interaction in some cases (figure 5d), and our predictions are poor. This is due to the fact that most interfaces of training set are hydrophobic and our parameters perform this characteristic faithfully.

Therefore, it may be useful to classify interfaces of training set according to

在文檔中以高斯演化方式預測蛋白質-蛋白質嵌合位置 (頁 17-0)