Chapter 4 Quantitative Structure Activity Relationships
4.1 Analysis of Protein-ligand Complexes to Predict Binding Affinity
One of the important issues in computer-aided drug discovery is to predict the binding affinity between a compound and a target protein. Compounds with high binding affinities to a target protein are often considered as the potential inhibitors that slow or block physiological, chemical, or enzymatic actions of this protein. Binding affinities are usually determined by some experiments such as microcalorimetry 116; 117, ELISA assays 118, NMR spectrometry 119, and surface plasmon resonance 120. The high-throughput screening method 121 is used to detect the binding affinities of a large number of compounds to identify lead compounds for a target protein. However, these experiment methods are often labor-intensive, time-consuming, and expensive. Therefore, many computational methods have been proposed to discover the lead compounds by predicting the binding affinities between compounds and the target proteins.
111
The structure-based virtual screening is one of the computational methods to identify lead compounds for a specific receptor from thousands of compounds. Each compound in the database is docked into a target protein and assigned a binding score for measuring binding affinities based on scoring functions. The scoring methods for virtual screening should effectively discriminate between correct binding states and non-native docked conformations during the molecular docking phase and distinguish a small number of active compounds from hundreds of thousands of non-active compounds during the post-docking analysis. The compounds with higher scores are considered as potential compounds and selected for biological assay which is the measurement of the pharmacological activity of a compound. The scoring functions that calculate the binding free energy mainly include knowledge-based 57, physics-based 58, and empirical 59 scoring functions.
Physics-based energy functions are based on physical mechanisms and often derived from ab initio quantum-mechanical calculations according to the principles of physics. One advantage of physics-based energy functions is the lucid physical meaning of each individual term, but physics-based energy functions often requires the high-computation cost and their energy landscapes are often very rugged. In general, an empirical scoring function has simplified energy terms based on physical mechanisms. Knowledge-based scoring functions are derived from energy-like functions by considering the distributions of inter-atomic distances in a set of crystal structures of protein–ligand complexes.
In practice, the performance of a scoring function is limited by our incomplete understanding of the complex issues involved in chemical interactions. The inaccuracy of the scoring methods, i.e., inadequately predicting the true binding affinity of a ligand for a receptor, is probably the major weakness for virtual screening. Some knowledge-base scorings, such as X-SCORE 122, ChemScore113, DrugScore 57, and PLD 123, applied regression techniques to predict binding affinities by deriving from a set of protein-ligand complexes with experimental binding affinities. The deficiency of current scoring functions is mainly due to the inadequate descriptions of the interactions between ligands and proteins 25. For example, most scoring functions use simple model to handle metal-ligand and water-ligand interactions and consider hydrogen-bonding interactions as the same even if some interactions are essential for chemical reactions.
Here, we address these issues by deriving 87 descriptors from 891 protein-ligand complexes selected from PDB according to five dimensions, including protein-ligand interactions, structural and physicochemical descriptors of a compound, protein binding site evolution, metal-ligand interactions, and water effects. Based on these 87 descriptors, we applied a stepwise regression method to select top five descriptors, which are highly correlated to experimental affinities, for developing the GemAffinity. The GemAffinity was then used to predict binding affinities on an independent set and it outperforms 12 comparative scoring functions on this set. For post-screening analysis, we applied The GemAffinity to score the docked complexes generated by GEMDOCK 2, which is well-developed molecular docking tool, on four targets. The GemAffinity is able to enrich the
112
prediction accuracy of GEMDOCK on these four targets.
4.1.2 Materials and Methods Overview
Figure 4.1.1 shows the overview of deriving the descriptors and developing GemAffinity for
predicting binding affinities. With the aim to identify descriptors for the binding process in the molecular recognition, we first selected the 989 protein-ligand complexes with experimental binding affinities from PDB. We derived 87 descriptors from protein functions and evolution, ligand structures and physicochemical aspects, and molecular interactions. Furthermore, a stepwise regression was applied to find the relationship between descriptors and binding affinities. Some key selected descriptors, which significantly reflect the experimental binding affinities, are used to develop GemAffinity for the prediction of binding affinities. Finally, the GemAffinity was evaluated on an independent testing set and post-screening analysis on four target proteins.113 Protein-ligand interactions
Water effects
Protein binging-site evolutionary property
QSAR-based compound properties
Zinc
Metal-ligand bonding
Generating descriptors
Analyzing important descriptors by stepwise regression
Developing GemAffinity
Evaluation on
•Binding affinity prediction
•Virtual screening Collecting
complexes with experimental binding affinities
PDBbind
Figure 4.1.1. The overview of analyzing protein-ligand complexes to develop the GemAffinity for
predicting binding affinities.114
Protein-ligand complex dataset
We collected 1,091 protein-ligand complexes with experimental binding affinities from the PDBbind124. Among these 1091 complexes, 989 complexes are selected by excluding 80 complexes (missing atoms in binding sites) and 22 complexes (no HETATM records of water atoms). The descriptors of the excluded complexes are incorrect due to the miss data and may bias the results.
The remaining complexes were randomly divided into a training set (891 complexes) and a testing set (98 complexes). For each complex, we used the negative logarithm of Kd or Ki (i.e. -log Kd or -log
K
i) as its binding affinity.Virtual screening dataset
A widely used approach to test scoring functions for the virtual screening is to rank the docked poses generated by docking tools to identify inhibitors (substrates) for a target receptor from thousands of compounds. We selected four target proteins with ~1000 compounds to evaluate the GemAffinity performance and to compare it with other methods. These four target proteins include thymidine kinase (TK), estrogen receptor antagonist (ER), estrogen receptor agonist (ERA), and human carbonic anhydrase II (HCAII). The receptors for these screens cover different receptor types and therefore provide a reasonable test of scoring functions. Four complexes of the target proteins were selected for virtual screening from the PDB: TK complex (PDB code: 1kim 14), ER-antagonist complex (PDB code: 3ert 71), ER-agonist complex (PDB code: 1gwr 125), and HCAII complex (PDB code:1cil 126). These complexes were reasonable choices because their ligand-binding cavities are wide enough to accommodate a broad variety of ligands and therefore did not require binding site modifications.
For each target, the screening compound data set proposed by Bissantz et al.127 consists of 990 random compounds. In addition, the set includes 10 known active compounds 2 for TK, ER, and ERA; and 20 known active compounds 126; 128; 129; 130; 131 for HCAII target. The active compound set of each target protein, target proteins, and 990 random compounds are available on the Web at
http://gemdock.life.nctu.edu.tw/dock/download.php.
Descriptors for binding affinity
We derived 87 descriptors from 891 complexes described in the protein-ligand complex dataset according to five dimensions: protein-ligand interactions, protein binging-site conserved properties, QSAR-based (quantitative structure activity relationship) compound properties, water effects, and metal-ligand bonding. These dimensions are described as follows:
115
Protein-ligand interactions: We considered three protein-ligand interaction types, including van der
Waals interactions, electrostatic interactions, and hydrogen-bonding interactions. The van der Waals force consists of a piece-wise linear potential (PLP), generated by GEMDOCK 2 and Lennard-Jones potential. The hydrogen-bonding force includes a PLP 2 and a non-linear potential generated by AutoDock 4. The electrostatic force was measured by the GEMDOCK. In addition, we also regarded the numbers of protein-ligand interactions within different cutoffs for each force type. For van der Waals force, the distance of an atomic pair between a protein and its ligand was divided into 10 cutoffs which are 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, and 7.0 Å. For hydrogen-bonding and electrostatic interactions, the distance was divided into 10 cutoffs, including 2.5, 2.6, 2.7, 2.8, 2.9, 3.0, 3.1, 3.2, 3.3, and 3.4 Å. In total, there are 35 descriptors for protein-ligand interactions.Protein binging-site evolutionary property: Conserved residues in a binding site of a protein often
highly correlated to the biological functions. For example, the catalytic charged residues of a protein are often conserved across many species. They polarize substrates to stabilize transition states 132. If these conserved residues mutate, the protein may lose their functions or execute different biological functions. Based on these observations, we used the number of highly conserved residues, which form hydrogen bonds between a protein and its ligand, as the descriptor of protein binging-site evolutionary property in the binding site.Here, we developed a conserved score method, derived from our previous work on protein-protein interactions 133, to measure the conversation degree of a residue in the binding site.
The amino acid sequence of a protein-ligand complex was first subjected to PSI-BLAST 42 by searching on UniRef90 134. The E-value cutoff of PSI-BLAST was set to 10-5 and the number of the iterations was 3. Based on the position specific scoring matrix (PSSM) automatically generated by PSI-BLAST, the conserved score of each residue was defined as
C
i= M
ir — Krr (4.1.1)where Mir is the value in the PSSM for the residue type r at the position i, and Krr is the diagonal value of BLOSUM62135 for the residue type r. The residue-conserved descriptor was defined as
∑
protein and its bound ligand. H is the number of highly conserved residues with hydrogen bonds of a protein. Among 891 complexes in the training set, 571 complexes consist of highly conserved residues forming hydrogen bonds between proteins and their bound ligands.We used HCAII (PDB code: 1cil 126) as an example to describe the steps of deriving the highly conserved residues forming hydrogen bonds with its bound ligand (Figure 4.1.2). First, the multiple
116
sequence alignment (MSA) of the query sequence and the homologous sequences are derived by PSI-BLAST (Figure 4.1.2a). Based on the MSA and Equation 4.1.1, conserved score of each residue can be calculated. Here, the residue is considered as a highly conserved residue (e.g. GLN92, HIS94, HIS96, HIS119, and THR199) when its score is greater than or equal to 1.0 (Figure 4.1.2b). Finally, we count the number of highly conserved residues which form hydrogen bonds with bound ligand. In this example, the number of highly conserved residues with hydrogen bonds is 5.
HIS119 Score: 1
HIS96 Score: 2
HIS94 Score: 2
GLN92 Score: 2
THR199 Score: 2
Ligand Zinc
92
conserved
score 2
ratio
(a) (b)
94 96 119 199
2 2 1 2
91
-4
HCAII
Figure 4.1.2. An example of describing metal-ligand bonds and highly conserved residues forming
hydrogen bonds. (a) The multiple sequence alignment of the query sequence, human carbonic anhydrase II (HCAII, PDB code: 1cil), and its homologous sequences by PSI-BLAST. The E-value cutoff of PSI-BLAST was set to 10-5 and the number of the iterations was 3. The conserved scores of part residues are given in the bottom. (b) The residues that are highly conserved and form hydrogen bonds with the ligand are GLN92, HIS94, HIS96, HIS119, and THR199. In this example, the number of highly conserved residues with hydrogen bonds is 5. In addition, the ligand also forms a metal-ligand bond (red line) with the Zinc (the bond length is 1.96 angstrom). This metal-ligand bond is one of the primary binding interactions in HCAII.QSAR-based compound properties: The QSAR methods
53; 136 demonstrated that physical and biological properties of a ligand are useful for predicting binding affinities of a group similar compounds. We used the QSAR module of the Cerius2 to generate 26 compound descriptors, such as spatial, conformational, electronic, structural, and thermodynamic terms.117
Water effects: In general, a compound should be solvated before it binds to its receptor. The water
effects play a crucial role in mediating the interactions between proteins and their bound ligands 137. The interactions between the bound ligand and structural waters in a complex were used to measure the water effects. We calculated the number of structural water molecules which are within a specific distance cutoff around a ligand. The distance bins are classified into 12 cutoffs, including 2.5, 2.6, 2.7, 2.8, 2.9, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, and 3.6 Å.Metal-ligand bonding: The metal atoms in an active site often play the key role for stabilizing
ligands and reaction catalysis. However, many scoring functions consider the interaction between metal atoms and ligands as a simple hydrogen-bonding interaction. In this study, we separated metal-ligand bonding interactions from hydrogen-bonding interactions by considering the metal atoms which are within a specific distance from a ligand. In the 891complexes, 99 and 152 complexes have metal atoms which are within 2.2 Å and 3.6 Å distance from ligand, respectively.Figure 4.1.3 shows that most of metal-ligand bonding distances were less than 2.8 angstrom. It meant
that the metal-ligand bonding distances were almost shorter than the normal hydrogen bonding distances. The metal-ligand bonding interaction is 1 if a metal atom is less than a specific distance;conversely, the value is 0. Here, we considered 13 distance cutoffs which are 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3.0, 3.1, 3.2, 3.3, and 3.4 Å. In the HCAII, the ligand forms a metal-ligand bond (red line) with the Zinc and the bond length is 1.96 (angstrom)Å (Figure 4.1.2b). This metal-ligand bond is one of the primary binding interactions in HCAII 126.
1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
0 5 10 15 20 25
Metal-ligand Bonding Distance (A)
Complex Number
Figure 4.1.3. The distribution of metal-ligand bonding distances in the 891 complexes in the training
set.118
Stepwise regression analysis
The stepwise regression 138 method was used to select descriptors which are highly correlated to protein-ligand binding affinities one by one in the order of descending importance. Here, we employed the simple stepwise regression method to develop the scoring methods. This method is often to avoid the ill effects (such as over fitting and the loss of biological/physical/chemical meaning.) of the machine learning approaches, such as the support vector machines, genetic algorithms, and neural networks. This model first selected the descriptor with the highest Pearson’s correlation coefficient into the predicted model. The model sequentially added the other descriptor which is able to improve the correlation between predicted binding affinities derived by the selected descriptors and experimental binding affinities. The process was stopped if the improved correlation of the added one was less than 0.005 to reduce ill-effect of the overfitting data. Finally, the model selects five descriptors for the GemAffinity to predict binding affinities (Table 4.1.1).
Table 4.1.1. The selected descriptors in the new scoring function
Selected
order Name r a Descriptors
1 fvdw 0.497 Sum of Lennard-Jones potential between a protein and a ligand 2 VMetal 0.544 The distance between metal irons and a ligand is less than 2.2 Å 3 NrotBon 0.579 Number of rotatable bonds of a ligand
4 Nwater 0.594 Number of structural waters which are near to a ligand within 3.6 Å 5 NconHB 0.599 Number of highly conserved residues forming hydrogen bonds between
proteins and ligands
a the correlation between the predicted binding affinities by stepwise regression models and experimental binding affinities.
Application on virtual screening
A scoring method should be able to effectively discriminate between correct binding states and non-native docked conformations during the molecular docking phase. For virtual screening, it is important for scoring functions to identify a small number of active compounds from hundreds of thousands of non-active compounds during the post-docking analysis. Here, we evaluated the GemAffinity to score docked protein-ligand complexes generated by the GEMDOCK on four targets (i.e. TK, ER, ERA, and HCAII). Our previous works show that the GEMDOCK is comparative to some approaches (e.g. GOLD 5 and FlexX 139) for molecular docking 2 and virtual screening 9; 10.
119
In this study, the GEMDOCK parameters in the flexible docking included the initial step sizes (
σ
=0.8 andψ
=0.2), family competition length (L = 2), population size (N = 200), and recombination probability (pc= 0.3). For each ligand screening, GEMDOCK optimization stopped either when the convergence was below a certain threshold value or the iterations exceeded the maximal preset value of 60. Therefore, the GEMDOCK generated 800 solutions in one generation and terminated after it exhausted 48,000 solutions for each docked ligand. Standard parameters of the GOLD 5 program and its scoring function (i.e. GoldScore) were used in this study. For each of the 10 genetic algorithm (GA) runs, a maximum number of 10,000 operations were performed on a population of 50 individuals. The maximum distance between hydrogen donors and fitting points was set to 2 Å, and nonbonded van der Waals (vdW) energy was cut off at 4.0 Å. To further speed up the calculation, the GA docking was stopped when the top three solutions were within 1.5 Å rmsd of each other. These parameters are chosen according to the standard default settings recommended by the authors for virtual screening.4.1.3 Results and Discussion Scoring function
The stepwise regression method selects top five descriptors for predicting binding affinities (Table 4.1.1). These descriptors are the sum of Lennard-Jones potential (fvdw), the number of metal-ligand bonds (VMetal), the number of rotatable bonds of a ligand (NrotBond), the number of structural waters which are near to a ligand within 3.6 Å (Nwater), and the number of highly conserved residues forming hydrogen bonds between ligands and proteins (NconHB). The coefficient of each descriptor and the GemAffinity is given as
−log(Kd,pred
)= −0.072f
vdw+ 1.168V
Metal − 0.078NrotBond −0.039Nwater+ 0.102N
conHB+3.430
The selected orders of these five descriptors are correlated to the accuracy orders (e.g. 0.497, 0.544, and 0.579 shown in Table 4.1.1), which are the Pearson’s correlation coefficient between predicted binding affinities and experimental binding affinities. For example, Pearson’s correlation coefficients are 0.497 and 0.544 using only the first descriptor (i.e. fvdw) and the first two descriptors (i.e. fvdw and
V
Metal). Please note the values of this scoring function and −log(Kd,pred) are positive to present a high
binding affinity. In general, fvdwis negative and its coefficient (i.e. −0.072) is negative; therefore, the value −0.072fvdw is positive and positively contributes to binding affinities of compounds. VMetal andN
conHB are positive and their respective coefficients are also positive values which positively contribute to the binding affinities. On the other hand, Nwaterand N
rotBond are positive but their respective coefficients are negative. These two terms negatively contribute to binding affinities.120
Selected descriptors
The Lennard Jones 12-6 potential was selected first and its Pearson’s correlation coefficient is 0.497 (Table 4.1.1). This result shows that the complementary shape between proteins and their ligands is critical for protein-ligand binding affinities. In addition, Pearson’s correlation coefficient between van der Waals forces (i.e. PLP) of the GEMDOCK and the experimental binding affinities is 0.48 on 891 complexes in the training set. Lennard Jones potential function has various parameters of atomic pairs, and the GEMDOCK treats all atomic pairs as the same. This difference may be one of reasons why the performance of Lennard-Jones potential function is lightly better than the one of the GEMDOCK.
The energy function, using the PLP potential to soften the repulsive term of Lennard–Jones potential, of GEMDOCK has a good performance in flexible protein-ligand docking. The short range repulsive interactions (e.g. Lennard–Jones potential) tend to infinity at low interatomic separation leading to rough energy surfaces with high energy barriers. A soft scoring function (e.g. PLP potential used in GEMDOCK) has been applied for softening the repulsive intermolecular potential to decrease the strong sensitivity of interaction energies to local conformation changes. Generally, a soft scoring function has the benefit of being computationally efficient, conversely, it may increase the number of false near-native solutions (structures). The tradeoff of its advantages and limitations can be optimized.
The metal-ligand bonding is the second selected term. The metal-ligand bonding is a metal-ligand interaction with the distance cutoff less than 2.2 angstrom in this study (Figure 4.1.3). A metal-ligand bond is often a strong bonding interaction and is able to stabilize the interactive conformations between a protein and its compound (Figure 4.1.4). In our GemAffinity, we highlighted and considered the metal-ligand bond as a special force because its distance cutoff (i.e.
2.2 angstrom) is much shorter than the general hydrogen-bonding distance (i.e. 2.8 angstrom). This scoring function discriminates between metal-ligand interactions and hydrogen bonds to avoid energy rises drastically when the distance of a metal-ligand bond is shorter than 2.2 angstrom. Figure
4.1.4 shows the effect of the metal-ligand bonds in the cytidine deaminase protein (PDB code 1ctu
and 1ctt 140). The binding affinity of the complex with the metal-ligand bond (1ctu) between zinc2+and the bound ligand is 11.92. On the other hand, the binding affinity of the complex (1ctt) without metal-ligand bond (1ctt) is 4.52. The only change of the binding affinities of these two complexes is due to the loss of the metal-ligand bond 140.
121
GLU104 CYS129
GLU91
ASN89 ALA103 Ligand
GLU104 CYS129
GLU91
ASN89 ALA103 Ligand
Zinc Zinc
1ctu 1ctt
Figure 4.1.4. The effect of the metal-ligand bonds. In the cytidine deaminase protein (PDB code 1ctu
and 1ctt), the binding affinities of the complexes with the metal-ligand bond (1ctu) and without metal-ligand bond (1ctt) are 11.92 and 4.52, respectively. The change of the binding affinities of these two complexes is due to the loss of the metal-ligand bond.The third selected descriptor was the number of rotatable bonds of a ligand. This descriptor is important for molecular mechanics because the freedom of rotatable bonds become lower during the binding process. The number of rotatable bonds gave a measure of the unfavorable torsional entropy loss upon protein-ligand binding affinities. The fourth selected descriptor (i.e. the number of structural water molecules within 3.6 Å distance from the bound ligand) was the water effects. A large amount of water molecules around a ligand within 3.6 angstrom imply that the large volume of the ligand is exposed to the solvent. A ligand buried deeply inside a protein could have larger
The third selected descriptor was the number of rotatable bonds of a ligand. This descriptor is important for molecular mechanics because the freedom of rotatable bonds become lower during the binding process. The number of rotatable bonds gave a measure of the unfavorable torsional entropy loss upon protein-ligand binding affinities. The fourth selected descriptor (i.e. the number of structural water molecules within 3.6 Å distance from the bound ligand) was the water effects. A large amount of water molecules around a ligand within 3.6 angstrom imply that the large volume of the ligand is exposed to the solvent. A ligand buried deeply inside a protein could have larger