CHAPTER 3. The Relevance of Protein-Ligand Interaction Profiles in Computer-Aided
3.3 Summary
Methods of post screening analysis that enhance virtual screening enrichment and retrieve target compounds more accurately are of great use and interest in current bioinformatics. In this review we summarized and compared methods of VS and post screening analysis of lead compounds which emphasize the relevance of interaction profiles in mining suitable candidates.
22
SIFt (structural interaction fingerprint) is one of the pioneer methods in post screening analysis to include interaction-specific information into the real number strings. This enables the visualization, organization, analysis and retrieval of structures containing key interactions or specific features. A combination of SIFt and ChemScore (an empirical scoring function) contributed to a modest increase in the enrichment factor (EF) which was calculated based on the ability to recover known inhibitors. The enrichment increased from 37.0 EFa (SIFt) to 42. 3 EFa (SIFt + ChemScore) [23].
VISCANA (Visualized Cluster Analysis of Protein-Ligand Interaction) uses a different approach through the FMO method. It has the advantage of describing the charge-transfer between a receptor and a ligand in comparison to a conventional force field method using fixed atomic charges. The difference between VISCANA and other conventional screening methods is that most methods choose the higher rank of a docking score on a point. In VISCANA a compound with a low docking score may belong to the same cluster that contains active compounds and the compound could be a suitable candidate. However, Amari et al. affirmed in their study VISCANA needs further development of quantum mechanical methods (the second-order Møller-Plesset perturbation theory based on the FMO method) to obtain more reliable descriptions of van der Walls interactions and hydrogen bonds which are important in determining receptor-ligand binding [22]. Other post screening studies reveal that unreliable or insufficient descriptions of important interactions account for increased numbers of false positives [48].
iGEMDOCK, an integration of VS and post screening methods is based on the original evolutionary docking algorithm GEMDOCK, currently one of the pioneer methods used for combining VS with visualizing, organizing, analysing and data mining of lead compounds. It has an advantage over SIFt and VISCANA primarily due to the attempt of eliminating two key issues: 1) if a docking tool is used for VS, which post screening analysis can complement it best and 2) if a post screening analysis method is decided, which docking tool or VS method is most suitable. The difference in the post screening approach of iGEMDOCK and other methods (VISCANA and SIFt) is the use of a module which clusters compounds based on interaction profiles and atomic compositions. Selecting representative compounds from each cluster enables the maintaining of compound diversity and reduces the number of false positives. In addition, its pharmacological scoring function can reduce the ill-effect of energy-based scoring functions
23
which often favor high molecular weight or highly-polar compounds. This improves the screening accuracy when the molecular weights of the active compounds are less than 400 Daltons (Da) [52]. Most notably, GEMDOCK, the earlier version of iGEMDOCK was used successfully to screen and identify inhibitors for influenza virus neuraminidases and flaviviruses [6, 8].
We also emphasize on the use of VS and post screening analysis in the mining of novel compounds for various other applications (e.g. industry, agriculture, cosmetics and nutritional supplements). These areas have not been getting much attention in comparison to drug design whereas certain protein-ligand complexes constitute key compounds in developing various biochemical products [29-31]. VS and post screening analysis used in computer-aided drug design reveal great potential in such applications since prospect candidates used in cosmetics and other industries may be retrieved employing interaction profiles.
Although the methods investigated in this study, SIFt, VISCANA and iGEMDOCK employ different techniques (structural interaction fingerprint, ab initio FMO method and interaction energy modules) they have one common feature; the use of protein-ligand interaction profiles which can be further exploited in developing new and improved methods to retrieve and analyze potential candidates for drug design and other applications. Through the development of better techniques, measures and description of interaction energies can aid methods of novel compounds retrieval and analysis, improve in accuracy and selection of active compounds. In addition, these observations point to an important aspect in the computer-aided drug design and discovery, the necessity for more than one stage of clustering in post screening analysis. From this point we proceeded with developing our new method Two-Stage Combinative Clustering (TSCC) [48] which combines our specifically optimized docking tool (GEMDOCK) with two stages of clustering for an optimized post screening analysis.
24
CHAPTER 4
TSCC: Two-Stage Combinative Clustering for Virtual Screening Using Protein-ligand Interactions and Physical-Chemical Features
4.1 Introduction
Continuous advancements in high-throughput X-ray crystallography and genomics [2, 28]
account for increased numbers of available crystal structures enabling a more rapid development of new therapeutic targets. However, prospect ligands and proteins need to be screened in order to downsize groups [22, 23, 53] and select suitable candidates for post-screening analysis.
Clustering methods based on structural similarity which are employed in post-screening analysis generally improve the scoring function performance. In developing methods for 3D compound retrieval, a detailed understanding of intermolecular interactions between proteins and their ligands is critical to structure-based inhibitor design. Various post-screening analysis methods and clustering [23, 54-56] employ RMSD values, protein-ligand interactions and computation and comparison platforms for measuring distances. Since the above methods as well as TSCC encounter challenges of specific selectivity and false positives, we aim to provide advantages to our post screening analysis method by using two combined clustering stages to rank all compounds and select final representatives more efficiently and accurately. The final representatives can be confirmed through bioassays to verify their target and the proper activity and application.
Although similar methods (IBAC, SIFt and VISCANA) [21-23] have used visualization and clustering of compounds to enrich VS, they have not identified novel compounds for any practical applications (drug design or industrial purposes). In addition, with the use of such methods one main issue remains unsolved: which combination of VS and post screening analysis is the most efficient. Our goal is to provide a more efficient method for post screening analysis to identify novel compounds, their possible functions and practical applications. Thus, we employ the empirical energy function from GEMDOCK [26] and the basic premise of SIFt [23] to encode additional interaction-specific information into the real number strings, hydrogen bonds, van der Waal and electrostatic forces. By representing interactions at the atomic-level as opposed to the residue level and including measures of interactions strength, protein-ligand interactions can be described better and a more precise analysis of virtual screening can be obtained.
25
TSCC is accomplished by joining two clustering stages; one of protein-ligand interactions (e.g. hydrogen bonds, electrostatic interactions, and van der Waals forces) with another of physico-chemical features (e.g. atom composition). We employed our docking tool, GEMDOCK, to generate protein-ligand interactions and used the Accelrys CeriusQSAR module for obtaining physico-chemical features for the compounds. Based on normalized feature profiles, hierarchical and K-mean [57] clustering methods were used to cluster and select compound candidates. Since clustering based upon similarity requires a quantitative measure (descriptor) of the similarity between two molecules, 2D and 3D methods were used to generate a descriptor such as the atom pair descriptor (i.e. compound topological similarity) [58].
To handle the vast results from virtual screening and use more specific information for protein-ligand binding, we utilize the empirical energy function from GEMDOCK [26]
specifically optimized for virtual screening of ligands. GEMDOCK uses piecewise linear potential (PLP), a simple scoring function (Fig. 9), comparable to similar scoring functions for estimating binding affinities [60, 61]. Our previous works showed a comparison of GEMDOCK and other docking methods for 100 protein-ligand complexes and two virtual screening targets [49-50]. In addition, GEMDOCK has been successfully applied to identify inhibitors and binding sites for some targets [6, 8]. Here, we utilize the PLP of GEMDOCK to generate the protein-ligand interaction profiles.
Figure 9. The linear energy functions of the pairwise atoms for the steric interactions and Hydrogen bonds in GEMDOCK (bold line) with a standard Lennard-Jones potential (thin line) Yang et al. [26].
26
To demonstrate the efficiency of our method we successfully applied its combinative two-stage concept in two separate post screening analysis studies. In the first study (sections 4.2, 4.3) two compound sets (testing and verifying) were designed to determine if the protein-ligand interaction descriptor is suitable for identifying compounds with similar binding modes. The two sets were also used to determine if the compound structure descriptor is suitable to identify similar structure compounds and to evaluate the database enrichment potential and the property of compounds in the same cluster by docking a diverse set of compounds spiked with known active compounds into the same target protein.
4.2 Materials and Methods
The Two-Stage Combinative Clustering (TSCC) Methodology
The overview of TSCC concept in our first study is shown in Figure 10. We first calculated the atom-based protein-ligand interactions by converting every docked pose into a one dimensional real number string in order to visualize and analyze large data obtained from virtual screening using Yang et al [26].
Figure 10. Overall process of TSCC in our first study (a) First stage clustering using P-L interactions generated via GEMDOCK. (b) Second stage clustering of first stage results using physico-chemical features. (Figure obtained from our published study [48]).
27
Due to protein-ligand interactions representation, we were able to evaluate the distance of binding modes between two docked poses and to carry out hierarchical clustering analysis.
Compounds with a similar binding mode were visualized and grouped into clusters [59]. In our structure based clustering section, each structure was represented by a one dimension atom-pair descriptor, an approach proposed by Carhart et al [58]. After analyzing the distance between active and non-active compounds, a reference threshold was decided for demarcating similar compounds (Fig. 11).
Figure 11. Designing the reference thresholds for protein-ligand interaction and atom-pair descriptor (Figure obtained from our published study [48]). The complementation between atom-pair descriptor and the protein-ligand interaction descriptor is also shown in this figure. The distance threshold of atom-pair descriptor obtained was 0.55 (tanimoto coefficient) and the threshold of distance of protein-ligand interaction descriptor was 0.39 (correlation coefficient).
28
We generated two sets of structure-based virtual screening results: 1) to verify if the protein-ligand interaction descriptor is suitable for identifying compounds with similar binding mode and 2) to evaluate the database enrichment potential and the property of compounds in the same cluster by docking a diverse set of compounds spiked with known inhibitors into the same target protein.
4.2.1 Preparation of Target Protein and Compound Databases
The Ligand binding site was defined as a collection of amino acids using a cutoff radius of 10Å from each atom on the bound ligand, since most studies in lead discovery use a cutoff radius between 8 to 12 Å. Structure files were stored as a PDB format for GEMDOCK analysis.
Compound databases
We constructed two compound sets for screening against each target protein: thymidine kinase (TK) PDB id: 1kim, estrogen receptor alpha-agonist (ERα) PDB id: 3ert, estrogen receptor alpha-antagonist (ERα) PDB id: 1gwr, human dihydrofolate reductase (hDHFR) PDB id: 1hfr, tern n9 influenza virus neuraminidase (NA) PDB id: 1mwe. The structures used were obtained from the database Comprehensive Medicinal Chemistry (CMC) and American Chemical Directory (ACD) and compounds with molecular weights between 200 and 800 D were chosen only based on the similar size of our active compounds. The active compounds (61 compounds, Appendix 1 a) were listed as the following: 1) TK: 10, 2) ER α antagonists:
11, 3) ER α agonists: 10, 4) hDHFR: 10, and 5) NA: 20. The two crystal structures of human estrogen receptors alpha have been intensively studied for their different functions (agonist 1GWR promotes coactivator binding while antagonist 3ERT blocks it) and ability to bind on the same site of the protein. The agonists play an important role in regulation of gene expression and prevention of osteoporosis while the antagonists have been used as treatment of hormone-dependent breast cancer [60, 61].
29
The testing dataset contained 990 randomly selected compounds combined with known active compounds for each target protein using a method from Bissantz et al [3]. This is a small scale public set of compounds used by studies in lead compound discovery. All compound structures were converted to mol formats and their hydrogen atoms removed using CORINA3.0 in order to be virtually screened by GEMDOCK. The active compound, target proteins and compound sets are available on our web at http://gemdock.life.nctu.edu.tw/dock/download.php.
4.2.2 Preparation of Virtual Screening Result for Cluster Analysis
GEMDOCK was substantially modified, in preparation for the docking of different complex poses and to predict the binding affinity for each compound in the dataset using two key functions: 1) The searching algorithm [49] and 2) The scoring function which is based on an empirical energy function [50].
GEMDOCK scoring function
The energy function can be expressed by the following terms and equations:
Etot = Ebind + Epharma + Eligpre (1) where Ebind is the empirical binding energy, Epharma is the energy of binding site pharmacophores (hot spots), and Eligpre is a penalty value if a ligand does not satisfy the ligand preferences. Epharma
and Eligpre (see Mining pharmacological consensuses subsection) help select active compounds by improving the number of true positives. The values of Epharma and Eligpre are set to zero if active compounds are not available. Thus, the empirical-binding energy (Ebind) is given as:
Ebind =Einter +Eintra +Epenal (2) where Einter and Eintra are the intermolecular and intramolecular energies, respectively, and Epenal is a large penalty value if the ligand is out of the range of the search box. For this study, Epenal was set to 10,000. The intermolecular energy is defined as:
∑∑ ( )
factor that converts the electrostatic energy into kilocalories per mole. The lig and pro denote the numbers of the heavy atoms in the ligand and receptor, respectively. F( )
rijBij is a simple atomic30
pairwise potential function (Fig. 9) as defined in our previous study [5] where
Bij
rij is the distance between atoms i and j with interaction type Bij formed by pair-wise heavy atoms between ligands and proteins, Bij is either a hydrogen bond or a steric state. We used the atom formal charge to calculate the electrostatic energy [3], which is set to 5 or −5, respectively. The intramolecular energy of a ligand is: dihed is the number of rotatable bonds in a ligand. We followed the work of Gehlhaar et al [9] to set the values of A, m, and θ0. For the sp3-sp3 bond, A = 3.0, m = 3, and θ0 = π; for the sp3-sp2 bond, A = 1.5, m = 6, and θ0 = 0. When known active ligands are available, GEMDOCK uses a pharmacophore-based scoring function (Equation 1). If known active compounds are not available LPelec and LPhb are set to zero and GEMDOCK uses a purely empirical-based scoring function (Equation 2). After all of the protein-ligand interactions were calculated, the atom interaction-profile weight of the target protein representing the pharmacological consensus of a particular interaction was given as:
N
where N is the number of known active compounds and fjkis the total interaction number of an atom j (in a protein) interacting with an atom of known active ligands with the interaction type k (e.g., hydrogen bonding or hydrogen-charged interactions). An atom j in the reference protein was considered a hot-spot atom when Qkj was more than 0.5.
4.2.3 Testing and Verifying Datasets
The lowest energy conformation was retained for generating the representative docked pose of each compound.
31
Generation of Descriptors (Protein-Ligand interaction descriptors)
We converted 3D docked conformations (poses) into a one dimension real number string by calculating the energy between each atom present on protein and ligand. The interaction energy of each atom j on a protein is defined as:
( )
2rij is the distance between atoms i and j with interaction type Bij formed by pair-wise heavy atoms between ligands and proteins, Bij is either a hydrogen bond or a steric state. These two potentials are calculated by the same function, although from different parameters; V1, . . . , V6. qi and qj are the formal charges and 332.0 is a factor that converts the electrostatic energy into kilocalories per mole. The lig and pro denote the number of heavy atoms on the ligand.
( )
rijBijF is a simple atomic pair-wise potential function.
Atom pair descriptors
Atom-pair descriptors are 2D topological descriptors counting the distance between two atoms as the shortest path of bonds [58]. The procedure for preparing atom pair descriptors:
1) Structure files in mol format 2) Remove hydrogen atoms
3) Convert to mol2 format via CORINA3.0
4) Calculate atom pair descriptors via AP generator (distance bins: 15) 5) Store in binary coding form.
A total of 825 (55 x 15) atom pair descriptors were generated for each molecular structure by removing all columns with zero values.
Reference Threshold for Protein-Ligand Interaction and Atom-Pair Descriptor
To design a reference threshold of protein-ligand interaction, a verifying dataset was used in establishing a reference threshold of distance by determining a maximum discrimination that
32
exists between similar and non-similar binding modes. The equation is as follows:
max intra-d t inter-d t / 2 Where t is the reference threshold, Cintra-d t< is the number of intra active compound pairs with the distance < threshold and Cinter is the number of compound pairs between active and non-active compounds.
The Cluster Analysis Method
First, we used a protein-ligand interaction descriptor for clustering compounds with similar binding modes and applied the correlation coefficients as similarity measurements. The following formula was used:
deviation of X. X is the ith value of X. n is the number of descriptors. We applied the standard i UPGMA clustering method for calculating the distance between two clusters while constructing the dendrogram. The formula is defined as:
The reference threshold was calculated from the verifying dataset using equation (2) to determine the number of clusters.
Second, we applied the AP descriptor for clustering compounds within each clustering stage and applied the tanimoto coefficients as similarity measurements. Formula is as follows:
tani xy
X Y D = X IY
U (10)
33
where
D
xytani is the tanimoto distance between X and Y. X YI is the number of ON bits common in both X and Y, and the X YU is the number of ON bits present in either X or Y. This equation is similar to equation (4); Dxycorr byDxytani . The dendrogram graph was plotted for visualizing the binding mode of multi docked poses by the protein-ligand interactions.4.3 Results
Molecular Recognition
I. Thymidine kinase (TK)
The significance of TK as a target in computer-aided drug design is its involvement in the phosphorylation of nucleosides or nucleoside analogs [62]. Various antiviral drugs attack the replication of the viral genome with nucleoside analogs. These analogs are activated by phosphorylation with TK and prevent DNA synthesis by the introduction of a chain-terminating nucleoside at the 3’ end of the growing DNA strand. Thus, we screened against this target and choose the crystal coordinates of TK (Appendix 1a) in complex with its natural substrate (deoxythymidine). This is reasonable since the active site can accommodate a broad variety of ligands. The average RMSD of all ten docked poses was 1.39 Å. (Table 2)
II and III. Estrogen receptor α (ER α antagonists and agonists)
Estrogens contribute to the maintenance of bone tissue through a process involving bone resorption and bone formation [60] which makes for another appropriate target. The target protein structures of ERα (Appendix 1a) were obtained from PDB, whereas antagonists and agonists were derived from previous studies [3, 63]. We docked four antagonists into the target protein (3ert) and four agonists into another one (1gwr), and concluded their results based on the root mean square deviation (RMSD) error in the heavy atoms ligand between the docked pose and the crystal structure. The average RMSD of docked antagonists and agonists was 1.42 Å.
The RMSD values of 1hj1.AOE and 1qkm.GEN were larger than 2.0 Å because the native proteins were crystal structures of Erβ-ligand complexes. (Table 2)
34
Table 2. The RMSD between docked poses and crystal ligands [48]
TK (1kim) ER (3ert, 1gwr) DHFR (1hfr) NA (1mwe) 1e2k.TMC 0.69 1err.RALa 1.27 1boz.PRD 1.13 lig1l7f_BCZ 0.88 1e2m.HPT 0.51 3ert.OHTa 0.71 1dlr.MXA 0.62 lig1nnc_GNA 0.75 1e2n.RCA 1.34 1hj1.AOEa 3.13 1dls.MTX 1.53 lig2qwf_G20 0.60
TK (1kim) ER (3ert, 1gwr) DHFR (1hfr) NA (1mwe) 1e2k.TMC 0.69 1err.RALa 1.27 1boz.PRD 1.13 lig1l7f_BCZ 0.88 1e2m.HPT 0.51 3ert.OHTa 0.71 1dlr.MXA 0.62 lig1nnc_GNA 0.75 1e2n.RCA 1.34 1hj1.AOEa 3.13 1dls.MTX 1.53 lig2qwf_G20 0.60