Chapter 3: Post-analysis of Virtual Screening
3.1 Cluster analysis of Structure-based Virtual Screening by Using Protein-ligand Interactions
3.1.1 Introduction
With the recent development of high-throughput X-ray crystallography, the total number of structures will grow at an even greater speed81. And the enormous advances in genomics have resulted in a large increase in the number of potential therapeutic targets that are available for investigation. This growth in potential targets has increased the demand for reliable target validation, as well as technologies that can identify rapidly several quality lead candidates. Virtual screening methods are a primary source for the discovery of lead molecules for drug development, with high-throughput docking algorithms being among the most extensively used of these methods. The application of virtual high-throughput screening82; 83, to the drug discovery process invariably produces a large number of potential lead candidates. And it is well known that current scoring functions used in virtual screening campaigns are often inadequate at predicting the true binding affinity of a ligand for a receptor84. These prospective ligands need to be filtered in order to reduce their number for more precise and labor-intensive studies.
The purpose for utilizing post-analysis is to minimize the number of false positives in the selection list and to propagate the true hits to the top of the list. One of the post-analysis methods such as clustering based upon structural similarity can nonetheless generally improve the
63
performance of the scoring function8; 85 . Clustering molecules based upon similarity requires some quantitative measure (descriptor) of the similarity between two molecules. There are many different approaches to generate descriptor, include 2D and 3D methods. Most of the 2D methods have focused on representing a molecule based upon its own structural and chemical composition, like Atom-Pair. But it regardless the information from protein that is important in the field of structure-based drug designs. Deng and co-workers8 described a novel approach to representing the properties of a ligand. As opposed to calculating the properties of a ligand from the perspective of its own structural and chemical components, the Structural Interaction Fingerprint (SIFt) method represents a ligand by the interactions it forms in the binding site of a protein. Using seven bits per binding-site residue to represent seven different types of interaction, the SIFt method encoded a ligand-protein interaction into a binary string. The types of interaction that considered are hydrogen bond and physical contact. Recently another approach proposed by Amari et al,7have developed a clustering tool for visualized cluster analysis of protein-ligand interaction (VISCANA) that analyzes the pattern of the interaction of the receptor and ligand on the basis of quantum theory. They applied the ab initio fragment molecular orbital (FMO)86 method for represent the interaction between protein and ligand, which used the ab initio electronic structure calculation of proteins and encoding each docked pose into real number string. But the FMO method needed to obtain more reliable descriptions of van der Waals interactions and hydrogen bonds that are important for receptor-ligand binding.
We developed a cluster method for post analysis to improve enrichment for VS. The method combines protein-ligand interactions (e.g. hydrogen bonds, electrostatic interactions, and van der Waals), which are generated by our well-developed docking tool (i.e. GEMDOCK), and physical-chemical features and structures for each compound candidate selected by GEMDOCK.
The physical-chemical features of a compound were described by atom pair descriptors (i.e.
compound topological similarity) proposed by Carhart et al. Based on these normalized feature profiles, hierarchical clustering methods were used to cluster these compound candidates. For each cluster, this method selected a representative compounds for biological tests. Our method was evaluated on five well-known drug targets, including thymidine kinase (TK), dihydrofolate reductase (DHFR), estrogen receptor agonist (ESA), estrogen receptor antagonists (EST) and neuraminidase (NA). We also practically applied our method for the screening of Helicobacter pylori shikimate kinase (HpSK) and the test the inhibitor activities of selected compounds in bioassay.
3.1.2 Materials and Methods
We developed a cluster method for post analysis to improve enrichment for VS (Figure 3.1.1A).
The method combines protein-ligand interactions (e.g. hydrogen bonds, electrostatic interactions, and van der Waals), which are generated by our well-developed docking tool (i.e. GEMDOCK2; 66), and physical-chemical features and structures for each compound candidate selected by GEMDOCK.
The physical-chemical features of compound structures were described by atom pair descriptors (i.e.
64
compound topological similarity) proposed by Carhart et al87; 88 (shown as Figure 3.1.1B). Based on these feature profiles shown in Figure 3.1.1A, hierarchical clustering methods were used to cluster these compound candidates. For each cluster, this method selected a representative compounds for biological tests.
The interactions of atom pairs on each protein-ligand complex were collected as a real number vector which the length and order were corresponded to atoms on the binding site of target protein (shown as Figure 3.1.1B). The structure of each compound was represented by the atom-pair descriptor which was proposed by Carhart et al.,88; 89 (shown as Figure 3.1.1B). The atom-pair descriptor approach describes the molecular structure by the bonding interval of all pairs of atom types. Hierarchical clustering analysis was employed for analyzing the profiles of interactions and compound structures on MATLAB90; 91. The post-processing of interaction and structure analyses separates the screening candidates into several meaningful groups and the compound with the lowest docked energy in each group was selected as the representatives for the bioassays (Figure 3.1.1B).
For evaluate the method of cluster analysis, we adopt this method for five important drug targets and the results of validation suggested a threshold of cutoff. This analysis process was also applied to the virtual screening of Helicobacter pylori shikimate kinase (HpSK). The biological assays reported the discovery of a new inhibitor structure from five canadidates.
Virtual Screening Top rankers
Integrate interaction and structure analyses and select lowest energy conformations as representative structures
N
Figure 3.1.1. Cluster analysis for post-processing of virtual screening. (A) The flowchart of cluster
analysis. The top ranks in screening are selected for cluster analysis. Each compound is grouped by its interaction and molecular properties. (B) The cluster analyses of interaction and molecular properties. The interaction analysis uses the hydrogen bonding, electrostatic and van der Waals interactions of atom pairs on the protein-ligand complexes to cluster similar docked poses. The molecular property analysis used atom-paired descriptors to cluster similar molecular structures.65
Preparations of the target proteins and screening sets
The compound set for virtual screening was prepared by selecting them from the CMC database in May 2004 based on two criteria: (1) molecular weight ranging between 200 and 800, and (2) no compounds with multiple components. A set comprising 5,331 compounds was eventually obtained.
The five sets of target specific compounds were from the literatures (Thymidine kinase (TK) and human estrogen receptor alpha (ERα) antagonists17, ERαagonist92, human dihydrofolate reductase (hDHFR)10 and Neuramindase (NA)93). The other additional 990 compound set was randomly selected from ACD database for validation10; 17. All 3D structures of compound were generated by CORINA3.0 and shown in Figure 3.1.2.
The structure of the protein binding pocket on TK, ERα (for antagonist and agonist), hDHFR, NA and SK were isolated and prepared for the GEMDOCK. The structure of the binding pocket in the pre-described ligand-bound conformation (PDB code 1kim14, 3ert94, 1gwr94, 1hfr95, 1mwe96 and 1zui97) including including amino acids enclosed within a 10 Å radius sphere centered on the bound ligand, were used. The coordinates of protein atoms were taken from the PDB for the screening processing. GEMDOCK docked each compound in the data sets against these binding cavities, and ranked each compound by docked energy of the docked conformation. To validate our method, compounds for five validated set were chosen for cluster analysis. According to the ranking, top ranked compounds for SK were chosen for cluster analysis and in vivo biological activity tests to validate their inhibitory activities.
66
Figure 3.1.2. Active compound structures of validation sets. The compounds of each target are
denoted in the abbreviation of protein and numbers. The compound co crystallized in the complex of PDB is further denoted in the PDB code of the protein and the ligand.Docking method and scoring function
Our previous works 2; 9 have showed that the docking accuracy of GEMDOCK was better than some well-known docking tools, such as GOLD 5 and FlexX3, on a diverse data set of 100 protein-ligand complexes proposed by Jones et al.5 The screening accuracy of GEMDOCK were also better than GOLD, FlexX, and DOCK on screening the ligand database from Bissantz et al. (2000) for the thymidine kinase 28 and the estrogen receptor 9. In this study, GEMDOCK parameters in the flexible docking included the initial step sizes (
σ
=0.8 andψ
=0.2), family competition length (L = 2), population size (N = 300), and recombination probability (pc= 0.3). For each ligand screened,
GEMDOCK optimization stopped either when the convergence was below a certain threshold value or the iterations exceeded the maximal preset value of 60. For the latter case, GEMDOCK will produce 800 solutions in one generation and terminated after it exhausted 48,000 solutions for each compound in the screening set.The screening quality of docking methods using energy-based scoring functions alone is often influenced by the structure of the ligand being screened (e.g., the numbers of charged and polar atoms). These methods are often biased toward charged polar compounds due to the pair-atom
67
potentials of the electrostatic energy and hydrogen-bonding energy. In order to reduce this ill effect, GEMDOCK could evolve the pharmacological preferences from a number of known active ligands or from domain knowledge to take advantage of the similarity of a putative ligand to those that are known to bind to a protein’s active site, thereby guiding the docking of the putative ligand9. GEMDOCK could use either a purely empirical scoring function2 or pharmacophore-based scoring function9. When GEMDOCK used a pharmacophore-based scoring function, some known active ligands (more than two) or domain knowledge are required for evolving the pharmacological consensus according to our previous results. The empirical-binding energy (Ebind) is the sum of the intermolecular (Einter) and intramolecular energies (Eintra), respectively 2. The pharmacophore-based energy function 9 is the sum of three energy items, including the empirical binding energy (Ebind), the energy of binding site pharmacophores (Epharma), and a penalty value (Eligpre) if a ligand does not satisfy the ligand preferences. Epharma and Eligpre are especially useful in selecting active compounds from hundreds of thousands of non-active compounds by excluding ligands that violate the characteristics of known active ligands (or domain knowledge).
All ligands are docked into the target protein binding pocket and the atom based protein-ligand interaction descriptors is generated from the docked poses corr. The interactions of atom pairs on each protein-ligand complex were collected as a real number vector which the length and order were corresponded to atoms on the binding site of target protein
Atom pair descriptors
The method of atom-pair descriptor is to describe the molecular topology by counting the shortest path of valence bonds between two atom types. The definition of atom type is show as Table
3.1.1. Atom pair descriptors was generated from the self-developed atom-pair generator program,
and the methodology was proposed by Carhart et al.,88; 89. Two major components for constructing a set of atom-pair descriptors include the definition of atom type and the number of distance bins between two atom types. An atom-pair is a simple type of substructure defined in term of the atom types and the shortest path graph distance between two atoms. The graph distance is defined as the smallest number of atoms along the path connecting two atoms in a molecular structure. The formula of an atom-pair is as atom type i—(distance)—atom type j. Where the distance is the valence bond distance between the atoms type i and j in the case of a 2D atom-pair description. We clustered the SYBYL 23 atom types into 10 atom types (Table 3.1.1) in order to reduce the number of atom-pair descriptors and improve the accuracy. Our settings of atom-pair approach followed the previous research of Hans Matter98. The maxim um of valence bond distance was set as 14 in this study and a total of 825 (55 x 15) atom-pair descriptors were generated for each molecular structure99. The representation of 825 atom-pair descriptors is bit string.68
Table 3.1.1. Atom types used in the atom-pair descriptors
Description Atom type Atom type in SYBYL Mol2
Aromatic carbons C.ar C.ar
Nonaromatic carbons C.na C.3 C.2 C.1 C.cat
Aromatic nitrogen N.ar N.ar
Nonaromatic nitrogen N.na N.3 N.2 N.1 N.am N.4 N.pl3
Oxygen atoms in the sp3 hybridization state O.3 O.3 Oxygen atoms in the sp2 hybridization state O.2 O.2
All sulfur atoms S S.3 S.2 S.O S.o2
Phosphorus atoms P.3 P.3
Halogen atoms X F Cl Br I
Hierarchical cluster method
Hierarchical methods have the advantage of building up an interpreTable 3.1.relationship between the clusters. Hierarchical clustering analyses were carried out with MATLAB91. We removed the column with zero in each ligand before calculating the distance of interaction and structure vectors. The similarity distance of the protein-ligand interaction vectors are measured by the correlation coefficients. Formula was as followed.
( )( )
( )
1
11
n
i i
corr i
xy
x y
X X Y Y
D n S S
=
− −
= − −
∑
(3.1.1)
where Dxycorr is the correlation distance between docked poses X and Y. The Sx is the standard deviation of X. Xi is the ith value of X. n is the number of atoms in the binding cavity. The similarity distance of atom-pair descriptor strings is measured through the tanimoto coefficients. The formula was as followed.
tani xy
D X Y
= X Y I
U
(3.1.2)where
D
xytani is the tanimoto distance between X and Y. X YI is the number of ON bits common in both X and Y, and the X YU is the number of ON bits present in either X or Y. The standard UPGMA clustering method is adapted for calculating the distance between two clusters while constructing the dendrogram. The formula was as followed.(3.1.3)
69
The reference threshold was calculated from verifying dataset by Equation 3.1.4 for determining the number of clusters. is the method for measuring similarity. For interaction analysis, is
corr
D
xy . For molecular structure, isD
xytani. The dendrogram graph was plotted for visualizing the binding mode of multi docked poses by protein-ligand interaction.3.1.3 Results and Discussion
We applied the cluster analysis method on five datasets for validation. The verifying dataset was constructed by cross-docking of all 61 known active compounds against 5 target proteins. The testing dataset for virtual screening was constructed by docking known active compounds and 990 randomly selected compounds against 5 target proteins. GEMDOCK was adapted to produce the docked prediction and score the docked poses. The molecular recognitions of five classes of active compounds for target protein were shown in Figure 3.1.3. The average RMSD of docked poses and crystal conformations were below 2.0Å and the details were shown in Table 3.1.2. The residues of pharmacological consensus were used in the docking procedures but not in the scoring and clustering procedures.
A B C
D E
Y101-OH
R163-NH1 Y172
Q125-NE2 Q125-OE1
D351-OD1 H524-ND1
R394-NH2 E353-OE2
H524-ND1 R394-NH2
E353-OE2
G420-O
R152-NH1 R371-NH1 R371-NH2
R292-NH2 R292-NH1
E276-OE2 I7-O
V115-O
R70-NH1 R70-NH2
E30-OE2
E30-OE1
Figure 3.1.3. Molecular recognition on (A) hDHFR (1hfr), (B) TK (1kim), (C) NA (1mwe), (D)
ERα-agonist form (1gwr), (E) ERα-antagonist form (3ert). The residues of pharmacological consensus are label with their numbers. The hydrogen bonding of docked pose and residues are denoted as green dash lines.70
Table 3.1.2. The RMSD values of docked poses and crystal ligands
TK (1kim) ER (3ert, 1gwr) hDHFR (1hfr) NA (1mwe)
Complex RMSD Complex RMSD Complex RMSD Complex RMSD
1e2k.TMC 0.69 1err.RAL
a 1.271boz.PRD 1.13 lig1l7f_BCZ
0.881e2m.HPT 0.51 3ert.OHT
a 0.711dlr.MXA
0.62lig1nnc_GNA 0.75 1e2n.RCA 1.34 1hj1.AOE
a3.13 1dls.MTX
1.53lig2qwf_G20 0.60 1e2p.CCV 0.67 1uom.PTI
a0.81 1drf.FOL
1.24lig1bji_G21
0.811ki2.GA2
3.041gwr.EST
b 0.711hfr.MOT
0.51lig1f8b_DAN 0.64 1ki3.PE2
3.211l2i.ETC
b 0.521kms.LIH
1.36lig1f8c_4AM 0.46 1ki6.AHU 0.37 1qkm.GEN
b2.92 1kmv.LII
0.83lig1f8d_9AM 0.59 1ki7.ID2
0.493erd.DES
b 1.321mvs.DTM 0.75 lig1f8e_49A
0.601kim.THM 0.41 1ohj.COP 1.27 lig1ina_ST6
0.792ki5.AC2
3.142dhf.DZF
1.12lig1ing_ST5
1.03lig1inw_AXP 0.93 lig1inx_EQP 0.92 lig1ivc_ST2
2.09lig1ivd_ST1
1.02lig1ive_ST3
1.03lig1mwe_SIA 0.52 lig1xoe_ABX 1.33 lig1xog_ABW 2.42 lig2qwg_G28 0.80 lig2qwh_G39 0.74 Average
RMSD
1.58
Average RMSD
1.42
Average RMSD
1.03
Average RMSD
0.95
a Four antagonists dock into target protein (3ert)
b Four agonists dock into target protein (1gwr)
Significance test of descriptors
t-test for protein-ligand interaction descriptor
We verified whether the protein-ligand interaction descriptors could identify the similar binding poses from data. The similar binding poses were defined as the poses of active compounds against its target protein, and the dissimilar binding poses was defined as the poses of other compounds docked into the same target protein. For t-test, the H0 is that there are no differences under the representation of interaction descriptors and the results were listed in Table 3.1.3. The t-scores are calculated as the standard two sample t-test statistics:
71
The result of t-test showed that the representation of interaction descriptors for docked pose could show the difference of active and inactive ones.
Table 3.1.3. t-test for the interaction representations of similar and dissimilar binding poses (α=0.01).
Target
a
H
0 is that there are no differences in the representation of descriptorsa Standard Deviation
For the needs of post-analysis, we further checked whether the interaction representations could discriminate the active compounds and other four groups of non-active compounds in the docked interactions and conformations. Table 3.1.4 showed the details of tests. Docked poses on DHFR and NA fully passed all the tests, but the other poses on TK, ESA, and EST didn’t. We checked the docked poses on these proteins and found that the reason maybe came from the fused ring structures of ligands. ESA and EST were specific on the steroid compounds for their hormonal regulation functions; TK was specific on the thymidine base compounds for the kinase functions. But the skeleton of rings on the lignads of them were shared the closed lengths (Figure 3.1.2) and this phenomenon also showed on the distance of atom-pair descriptors.
72
Table 3.1.4. t-test for the interaction representations of five groups of compounds against target
proteins(α=0.01)a
H
0 is that there are no difference in the representation of descriptorsb Standard Deviation
t-test for atom-pair descriptor
The t-tests were also used for check the representations of atom-pair descriptors on compound structures. The similar structures were defined as the active compounds against its target protein, and the dissimilar structures were defined as the other compounds against the same target protein. For
t-test, the H
0 is that there are no differences under the representation of atom-pair descriptors and the results were listed in Table 3.1.4.73
Table 3.1.4. t-test for the atom-pair representations of similar and dissimilar structures (α=0.01).
Target
protein
H
0a Similar : Average DistanceDissimilar : Average Distance
P-value
Similar : Stdb of Distance
Dissimilar : Stda of Distance
DHFR Reject 0.42 0.63 5.84E-23 0.15 0.12
ESA Reject 0.24 0.66 4.60E-65 0.11 0.14
EST Reject 0.27 0.63 2.85E-56 0.14 0.14
NA Reject 0.32 0.65 1.75E-131 0.18 0.17
TK Reject 0.22 0.63 2.11E-93 0.09 0.19
a
H
0 is that there are no difference in the representation of descriptorsb Standard Deviation
Cutoff of hierarchical cluster for verifying datasets
The cluster sizes and members of hierarchical cluster method were depended on the determination of cutoff distance. We proposed an approach for cutoff determinate by validation sets.
We decided the cutoff according to maximize the true positive rates and minimize the false positive rates in the cluster analyses. In the cluster analyses of validate sets, we defined the true positive as the active compounds for its pharmacological target and the others compound were defined as false positives. The true positives were further denoted as intra set and the false positives were denoted as inter set. To maximize the true positives at a given distance threshold t, we defined an equation as followed
max intra-d t inter-d t / 2
intra inter
C C
C C
< >
⎛⎛ ⎞ ⎞
⎜⎜ + ⎟ ⎟
⎜⎝ ⎠ ⎟
⎝ ⎠ (3.1.7)
where t is the given distance threshold.
C
intra-d t< is the number of intra active compound pairs which had the distances < threshold t.C
inter is the number of compound pairs between active and non-active compounds. The threshold t of interaction and structure clusters were tested from 0 to 1 on five protein targets. For interaction cluster, the distance threshold t at 0.39 of correlation coefficient measurement had the maximum discrimination (88.9%). For structure cluster, the maximum discrimination (91.5%) was at the t=0.55 of tanimoto coefficient measurement (Figure3.1.4)
74
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1.0 intra_DHFR1
intra_DHFR2 intra_ESA intra_EST intra_NA1 intra_NA2 intra_TK inter_DHFR1 inter_DHFR2 inter_ESA inter_EST inter_NA1 inter_NA2 inter_TK
Compound structure
P ro te in -lig an d in te ra ct io n
Figure 3.1.4. The normalized interaction and structure distances of five classes of compounds. The
distance distributions of intra datasets are labeled in red, and the inter datasets are labeled in black.The cutoff distance of atom-pair descriptor for hierarchical cluster is 0.55 (tanimoto coefficients), and protein-ligand interaction descriptor is 0.39 (correlation coefficients).
Cluster analysis for five classes of protein targets
We demonstrate the characters of the cluster analysis through using the single feature representation.
Cluster of protein-ligand interaction
hDHFR the result of interaction clustering was shown on Figure 3.1.5. Three major clusters were
identified as cluster c, d and e. First, all active compounds were group together into the cluster a, (Figure 3.1.5A), All hDHFR ligands in cluster b had hydrogen-bond (E30-OE1, E30-OE2, V115-O, I7-O), van der Waals force (I60-CAR, F31-RING), the binding interactions of each docked poses within the cluster b were similar. The cluster c contained 6 TK ligands and one NA ligand, and all docked poses of cluster d were NA ligands (Figure 3.1.5A). Ligands in cluster d had hydrogen-bond (Y121-O, I7-O), and all docked poses with in cluster e had hydrogen-bond (E30-OE1, V8-N). There75
were two types of hDHFR active compounds (Figure 3.1.2) 9, the DHFR03, 04, 05, 09, 10, belonged to old types and had two more carboxylic acid group. The old drugs had different binding affinity
were two types of hDHFR active compounds (Figure 3.1.2) 9, the DHFR03, 04, 05, 09, 10, belonged to old types and had two more carboxylic acid group. The old drugs had different binding affinity