Chapter 2. Related works
2.4 Summary
As the number of protein structures increases rapidly, structure-based drug design and virtual screening approaches are becoming important and helpful in lead discovery1-2,6. A number of docking and virtual screening methods 16,23-24,35 have been utilized to indentify lead compounds, and some success stories have been reported 1-2,4-5,7-8,10. However, identifying lead compounds by exploiting thousands of docked protein-compound complexes is still a challenging task. The major weakness of virtual screenings is likely due to incomplete understandings of ligand binding mechanisms and the subsequently imprecise scoring algorithms . In the related works, several studies were proposed for improving the accuracy and precision in the VS processes. First, the scoring function of GEMDOCK evolves the pharmacological preferences from a number of known active ligands to take advantage of the similarity of a putative ligand to those that are known to bind to a protein’s active site, thereby guiding the docking of the putative ligand. In the post-screening analysis process, the consensus scoring strategy using data fusion and exploring diversity on scoring characteristics between individual scoring functions for improving VS is proposed. When the huge amount of VS data needs to be interpreted, the combinative cluster analysis is applied for effectively mining the representatives and easily visualizing the VS data. Although we have been successfully applied these methods on the VS studies of two important virus targets, dengue virus and influenza virus, some shortcomings are needed to be addressed.
17
Chapter 3
Site-moiety map for recognizing interaction preferences between protein pockets and compound moieties
3.1 Introduction
Most of docking programs16,23-24 use energy-based scoring methods which are often biased toward both the selection of high molecular weight compounds and charged polar compounds in the top ranks. Meanwhile, these approaches generally cannot identify the key features (e.g., pharmacophore spots) that are essential to trigger or block the biological responses of the target protein. Although pharmacophore techniques27 have been applied to derive the key features, these methods require a set of known active ligands that were acquired experimentally. Therefore, the more powerful techniques for post-screening analysis to identify the key features through docked compounds and to understand the binding mechanisms provide a great potential value for drug design.
To address these issues, we presented the SiMMap method to infer the key features by a site-moiety map describing the relationship between the moiety preferences and the physico-chemical properties of the binding site. This method also provides the web server for public access. According to our knowledge, SiMMap is the first public server that identifies the site-moiety map from a query protein structure and its docked (or co-crystallized) compounds. The server provides pocket-moiety interaction preferences (anchors) including binding pockets with conserved interacting residues; moiety preferences; and interaction type. We verified the site-moiety map on three targets, thymidine kinase, and estrogen receptors of antagonists and agonists. Experimental results show that an anchor is often a hot spot and the site-moiety map is useful to identify active compounds for these targets. We believe that the site-moiety map is able to provide biological insights and is useful for drug discovery and lead optimization.
3.2 Method
Figure 3.1 presents an overview of the SiMMap server for identifying the site-moiety map with anchors, describing moiety preferences and physico-chemical properties of the binding site, from a query protein structure and docked compounds. The server first uses checkmol (http://merian.pch.univie.ac.at/~nhaider/cheminf/cmmm) to recognize the compound moieties
18
a b
…
O S
HO O
OSO HO
S HOO
O
R222 H58 R222 E225 W88 Y172 R163 Q125
E1 H1 V1 H2
c
d
Step 1: Query a target protein structure and its docked (co-crystallized) compounds
Step 2: Generate protein-compound interaction profiles and identify compound moieties
Step 3: Derive an anchor candidate by identifying a pocket with significant interacting residues and moieties with Z-score ≥ 1.645
Step 4: Determine anchors by grouping neighbor anchor candidates with same type. For each anchor, identify its binding pocket, top-significant
interacting residues, moiety preferences, and anchor type
Step 6: Output graphically site-moiety map; anchors with moiety structures and
compositions; and pocket-moiety interactions.
Step 5: Determine site-moiety map with anchors and rescore compounds
Figure 3.1. Overview of the SiMMap server for the site-moiety map using herpes simplex virus type-1 thymidine kinase (TK) and 1000 docked compounds as the query. (a) Main procedure;
(b) The merged protein-compound interaction profile; (c) The pocket-moiety interaction preferences of the anchors: H (hydrogen-bonding). Each anchor consists of a binding pocket with conserved interacting residues, the moiety composition and anchor type; The site-moiety map has one hydrogen-bonding (H) and three van der Waals (V) anchors for ER. Each anchor contains the moiety structures and composition, anchor type, and key residues in the binding pocket.
19
and utilizes GEMDOCK35 to generate a merged protein-compound interaction profile (Fig.
3.1b), including electrostatic (E), hydrogen-bonding (H) and van der Waals (V) interactions.
According to this profile, we infer anchor candidates by identifying the pockets with significant interacting residues and moieties with Z-score ≥ 1.645. The neighbor anchor candidates, which are the same interaction type and the distances between their centers are less than 3.5Ǻ, are grouped into one anchor. These anchors form the site-moiety map describing interaction preferences between compound moieties and the binding site of the query (Figs.
3.1c and 3.1d). Finally, this server provides graphic visualization for the site-moiety map;
anchors with moiety structures and compositions; pocket-moiety interactions; and the relationship between anchors and moieties of query compounds.
3.2.1 Definitions of site-moiety map, anchor and pocket
The anchor (pocket-moiety interaction preference) is the core of a site-moiety map. An anchor possesses three essential elements: (1) a binding pocket with conserved interacting residues and specific physico-chemical properties; (2) moiety preferences of the pocket; (3) pocket-moiety interaction type (E, H, or V). An anchor can be considered as "key features" for representing the conserved binding environment element or a "hot spot" which involves biological functions. In addition, we regard a binding pocket, which consists of several residues significantly interacting to compound moieties, as a part of the binding site. The binding pocket often possesses specific physico-chemical properties and geometric shape to bind preferred moieties. The site-moiety map, which can help to assemble potential leads by optimal steric, hydrogen-bonding, and electronic moieties, is useful for drug discovery and understanding biological mechanisms.
3.2.2 Constructing site-moiety map
The SiMMap server performs six main steps for a query (Fig. 3.1a). Here, we used TK as an example for describing these steps.
Generating protein-compound interaction profiles and identification of compound moieties
First, users input a protein structure and its docked compounds. The server used checkmol to identify moieties of docked compounds and GEMDOCK to generate E, H and V interaction profiles. For each profile, the matrix size is N×K where N and K are the numbers of
20
compounds and interacting residues of query protein, respectively. An interaction profile matrix P(I) with type I (E, H, or V) is represented as
where pi,j is a binary value for the compound i interacting to the residue j (Fig. 4.2B). For H and E profiles, pi,j is set to 1 (green) if an atom pair between the compound i and the residue j forms hydrogen-bonding or electrostatic interactions, respectively; conversely, the interaction is set to 0 (black). For van der Waals (vdW) interaction, an interaction is set to 1 when the energy is less than -4 (kcal/mol).
SiMMap identified consensus interactions between residues and compound moieties with similar physical-chemical properties through the profiles. For each interacting residue (a column of the matrix P(I)) (Fig. 3.1b), we used Z-score value to measure the interacting conservation between this residue and moieties. The standard deviation (σ) and mean (μ) were derived by random shuffling 1,000 times in a profile. The Z-score of the residue j is defined as
σ
We treated protein-compound interactions as a binomial distribution, and then consensus interactions with statistical significance could be identified by their normal approximation.
Statistically, a binomial distribution is approximated by a normal distribution when either p≤0.5 and np>5 or p>0.5 and n(1 − p)> 5, where n is the number of trials and p is the probability of success. Here, n is the number of selected compounds and p is the probability of forming an interaction between a protein and a compound, that is, pi,j=1. Typically, the p values ranged between 0.01 and 0.03 in this study. While the binomial distribution is a normal approximation, at least 500 compounds should be selected for constructing an interaction profile matrix.
Deriving anchor by identifying a pocket with significant interacting residues and moieties Spatially neighbor interacting residues and moieties with statistically significant Z-score ≥ 1.645 were referred as an anchor candidate. Neighbor anchor candidates, which are spatially overlapped and the same anchor type, were clustered as an anchor and the anchor center is the
21
weighted geometric center of their interacting compound moieties. Here, two anchors were merged if the distance of two anchor centers is less than 3.5 Ǻ. In each anchor, top three residues with the highest Z-score values were regarded as key residues forming a binding pocket. For each anchor, we identified its moieties of docked compounds according to the moiety library derived from checkmol, and calculated the moiety composition (Fig. 3.1c).
These anchors form the site-moiety map (Fig. 3.1d) of the query.
Outputting graphically site-moiety map and identifying active compounds
SiMMap can be applied to identify active compounds for structure-based virtual screening.
One of weaknesses of virtual screening is likely incomplete understanding of the chemistry involved in ligand binding and the subsequently imprecise scoring algorithms. When a compound highly agrees with the anchors of the site-moiety map, this compound often activates or inhibits the target. The SiMMap server scores a compound by combining predicted binding energy of GEMDOCK and the anchor score between the map and the compound. The SiMMap score, S(i), for a compound i is defined as
∑
= + −= na a
M i i E
AS i
S 1 (0.5)
) 001 . 0 ( ) ( )
( (1)
where ASa(i) is the anchor score of compound i in the anchor a, n is the number of anchors, E(i) is the docked energy of compound i, and M is the atom number of compound i. The anchor score is set to 1 when the compound i agrees the moiety preference of the anchor a. Here, the anchor score and the term M0.5 are useful to reduce the deleterious effects of selecting high molecular weight compounds26. Based on SiMMap scores, we can obtain new ranks of query compounds.
3.3 Results 3.3.1 Web service
SiMMap is an easy-to-use web server (Fig. 3.2). Users input a protein structure without ligands in PDB format and its docked or co-crystallized compounds in MDL mol, SYBYL mol2, or PDB format (Fig. 3.2a). These docked compounds should be generated by any external docking methods (e.g., DOCK, FlexX, GOLD and GEMDOCK) before users uploaded these compounds. Typically, the SiMMap server yields a site-moiety map within 5 minutes if the number of query compounds is less than 100. This server provides the graphic
22
a
b d
… …
c
Figure 3.2. The SiMMap server analysis results using estrogen receptor (ER) and 1000 docked compounds as the query. (a) The user interface for uploading target protein structure and docked compounds. (b) The site-moiety map has one hydrogen-bonding and three van der Waals anchors for ER. Each anchor contains the moiety structures and composition, anchor type, and key residues in the binding pocket. (c) The details of moiety structures and residue-moiety interactions in the H1 anchor. (d) The SiMMap scores, ranks and the relationships between anchors and moieties of query compounds.
visualization of the site-moiety map and anchors elements, including a binding pocket with interacting residues, moiety compositions and structures, numbers of involved compounds, and anchor types (Fig. 3.2b). For each anchor, this server shows docked conformations of compounds and the detailed atomic interactions between pocket residues and moieties (Fig.
3.2c). In addition, SiMMap shows the new rank and compound moiety structures fitting the anchors for each query compound (Fig. 3.2d). SiMMap uses two open source tools for graphic visualization: Jmol (http://www.jmol.org/) for displaying three-dimensional protein and
23
compound structures with anchors and OASA (http://bkchem.zirael.org/oasa_en.html) for visualizing compound structures. The server allows users to download the anchor coordinates in the PDB format; interaction profiles; new ranks and anchor scores of query compounds.
3.3.2 Thymidine kinase and estrogen receptor
The SiMMap server inferred the site-moiety map of TK. This map consisted of four anchors (i.e., E1, H1, H2, and V1 (Fig. 3.1d) and the moiety composition and conserved interacting residues of each anchor (Fig. 3.1c). For example, the E1 anchor possesses a binding pocket with residue R222, and three moiety types (i.e., sulfuric acid monoester (40%), carboxylic group (35%) and phosphoric acid monoester (25%)) derived from 57 compounds. Meanwhile, the E1 includes the phosphate moiety of ATP and its residue R222 playing a major role to interact with the substrate 38-39. The preferred moiety types of an anchor are suitable groups interacting to conserved residues of the binding pocket. The moiety preference is able to guide the suggestion of functional group substitutions for lead structures.
We used estrogen receptor (ER), a therapeutic target for osteoporosis and breast cancer40, as the example. Based on 1000 docked compounds and ER, the SiMMap server identifies four anchors (H1, V1, V2, and V3) and provides moiety preferences and compositions in these anchors (Fig. 3.2b and 3.3). The H1 anchor comprises three residues (E353, L387, and R394) and five main moiety types: hydroxyl group (36%), carboxylic acid (16%), amine (7%), ketone (7%), and sulfuric acid monoester (6%) summarized from 319 compounds. Furthermore, three residues (L346, T347, and L525) and 839 compounds are involved in the V1 anchor, preferring five moiety types (i.e., aromatic ring (49%), heterocyclic group (22%), alkenes (11%), phenol (8%), and oxohetarene (4%)). The anchor V2 is a hydrophobic pocket containing L346, F404, and L387, and the former two re sidues are highly conserved41. These hydrophobic residues interact with aromatic ring (52%), heterocyclic group (23%), phenol (12%), alkenes (5%), and oxohetarene (3%). Finally, aromatic rings (55%), heterocyclic groups (17%), alkenes (11%), and phenols (9%) summarized from 560 compounds often form vdW contacts with the long side chains of M343, M421, and L525 in the anchor V3. The ring groups of antagonists are often stabilized by the side chains of M343, L346, T347, L387, M421, and L525. In this case, most selective estrogen receptor modulators of ER (e.g., EST_01 (raloxifene), EST_06 (LY-326315,) and EST_05 (EM-343)) agree with these four anchors (Fig. 3.2d and 3.3c). Anchors
24
Anchor SiMMap 22 ligands
a
Figure 3.3. The relationships between the site-moiety map and 22 co-crystallized ligands of ER.
(a) The mapping between four inferred anchors (binding pocket with conserved interacting residues) and these 22 ligands in the active site. (b) The moieties of these 22 ligands in each anchor. Black cell presents that the moiety of the compound does not agree with the anchor H1.
(c) The moiety compositions of 1000 docked compounds (SiMMap) and these 22 ligands.
Table 3.1. The relationship between the anchors and moieties of 15 co-crystallized ligands for TK
PDB code Compound structure Anchor
E1 H1 H2 V1
3vtk R-OH
1vtk R-OH
1p7c R-OH
1of1 R-OH
25
1ki6 R-OH
1ki8 R-OH
1e2k R-OH
1ki7 R-OH
1ki4 R-OH
1kim R-OH
1e2p R-OH
1ki3 R-OH Aromatic moiety
1ki2 R-OH Aromatic moiety
1qhi R-OH R-NH2 Aromatic moiety
2ki5 R-OH R-NH2 Aromatic moiety
Table 3.2. Comparing SiMMap with other methods on thymidine kinase and estrogen receptor by false-positive rates
True positive (%) Thymidine kinase (TK) Estrogen receptor (ER)
SiMMap DOCK a FlexX a GOLD a SiMMap DOCK a FlexX a GOLD a
80 6.3b 23.4 8.8 8.3 1.1 13.3 57.8 5.3
90 6.8 25.5 13.3 9.1 1.1 17.4 70.9 8.3
100 6.8 27 19.4 9.3 7.5 18.9 NA 23.4
a Summarized from Bissantz et al.17
identified by the SiMMap server often contain key pockets and moieties. To initially validate the anchors for biological mechanisms (e.g., ligand binding and catalysis mechanisms), we selected 15 TK and 22 ER co-crystallized ligands (Table 3.1 and Fig. 3.3). The corresponding moieties of these co-crystallized ligands were highly matched the anchors derived from 1000 docked compounds (10 known active ligands and 990 randomly selected compounds described
26
in Data sets). The site-directed mutagenesis shows that the conserved interacting residues of the anchors are often essential for ligand binding and catalysis mechanisms. For ER target, 22 ER co-crystallized ligands contain three consistent moieties that are hydroxyl group and aromatic rings (Fig. 3.3b). The hydroxyl group forms hydrogen bonds with R394 and E353 in H1, and the aromatic ring yields vdW contacts with L346, L387, and F404 in V2. The other consistent aromatic ring forms vdW contacts with L346, T347, and L525 in V1. These results show that an anchor is often a hot spot and involved in biological functions.
To provide initial validation of the SiMMap server for virtual screening, we selected TK, ER, and ERA with 1000 compounds as test sets. First, we compared the accuracies of SiMMap with those of GEMDOCK on these three targets based on true positive rates. SiMMap, combining anchor scores and docking energies (Equation 1), outperforms GEMDOCK on these cases. We then compared SiMMap with other three programs (DOCK, FlexX, and GOLD) on TK and ER sets. All approaches were tested using the same proteins and compound sets (Table 3.2). When the positive rate was 90%, the false positive rates were 6.8% (SiMMap), 25.5%
(DOCK), 13.3% (FlexX), and 9.1% (GOLD) for TK and were 1.1% (SiMMap), 17.4%
(DOCK), 70.9% (FlexX), and 8.3% (GOLD) for ER.
The compound, which agrees with anchors of the site-moiety map, is often able to activate or inhibit the target protein (Tables 3.1 and 3.3). In addition, the anchor score (i.e. AS(i) defined in Equation 1) of SiMMap can be used to reduce the ill-effect of the energy-based scoring methods which are often biased toward both the selection of high molecular weight compounds and charged polar compounds25-26. For example, according to the SiMMap scores (Equation 1), the top ranks of ER, MFCD0002206 (masoprocol) and MFCD00012748 were identified as the analogs of the active compounds (Table 3.3). The anchor score of SiMMap was helpful to reduce the highly polar compounds (e.g., MFCD00004690 and MFCD00013089 in ER) whose anchor scores are low. The anchor score of SiMMap can easily combine with other energy-based scoring functions.
3.4 Summary
The utility and feasibility of SiMMap method is demonstrated for statistically inferring the site-moiety map describing the relationship between the moiety preferences and physico-chemical properties of the binding site. The validation results show that the site-moiety map is
27
useful to reflect biological functions and identify active compounds from thousands of compounds. In addition, the site-moiety map can guide to assemble potential leads by optimal steric, hydrogen-bonding, and electronic moieties. We believe that the SiMMap serve is able to provide the biological insights of protein-ligand binding models, enrich the screening accuracy, and guide the processes of lead optimization.
Table 3.3. The mapping between the anchors and active and typical compounds for ER
Compound Structure GEMDOCK rank
SiMMap rank
SiMMap
score H1 V1 V2 V3
EST_01 2 1 4.239
EST_02 32 19 4.216
EST_03 28 16 4.217
EST_04 8 5 4.226
EST_05 6 4 4.228
EST_06 3 2 4.238
EST_07 21 13 4.218 Other
EST_08 10 7 4.225 Other Other
EST_09 30 20 4.216
EST_10 246 84 4.193
MFCD00002206 4 3 4.232 Other
MFCD00012748 17 11 4.221 Other
MFCD00004690 5 154 3.23 Other Other
MFCD00013089 25 617 2.218
Compound Structure GEMDOCK rank
SiMMap rank
SiMMap
score H1 V1 V2 V3
EST_01 2 1 4.239
EST_02 32 19 4.216
EST_03 28 16 4.217
EST_04 8 5 4.226
EST_05 6 4 4.228
EST_06 3 2 4.238
EST_07 21 13 4.218 Other
EST_08 10 7 4.225 Other Other
EST_09 30 20 4.216
EST_10 246 84 4.193
MFCD00002206 4 3 4.232 Other
MFCD00012748 17 11 4.221 Other
MFCD00004690 5 154 3.23 Other Other
MFCD00013089 25 617 2.218
28
Chapter 4
The application of site-moiety map for characterizing protein-ligand binding sites and discovering adaptive inhibitors for orthologous protein targets
4.1 Introduction
The expanding number of protein structures and advances in bioinformatics tools have offered an exciting opportunity for structure-based virtual screening methods in drug discovery42. Although there are some successful agents in the antibiotic development, few agents act at novel molecular sites to target multiple antibiotic–resistant pathogenic bacteria
43-44. However, the screening tools are often designed for one-target paradigm and the scoring methods are highly target-dependent and energy-based, and cannot persuasively identify true leads, which results in a lower success rate 2-3,45. To discover adaptive inhibitors of multiple targets is an emergent task for drug design46-48.
SiMMap modeling provides an in silico post-screening analysis to establish a target binding site-moiety map that comprises three crucial elements (conserved interacting residues, the moiety preference, and pocket-moiety interaction type [electrostatic (E), hydrogen-bonding (H), or van der Waals (V)])49. For structural-based virtual screening 1-2,9,45, this method offers a more efficient solution for drug discovery, particularly in the absence of a pharmacophore model describing the structure-activity relationship extracted from experiments.
We introduce the concept of orthSiMMap to represent the conserved binding environment elements or "hot spots" among orthologous targets. Identification of these consensus features in the orthSiMMap can then be used for identifying novel binding partners of orthologous targets.
We then developed a method to extract the conserved features of the ligand-binding environments of orthologous targets, thus establishing the orthSiMMap using structure-based virtual screening. We focused on shikimate kinase (EC 2.7.1.71), the fifth enzyme in the shikimate pathway that is present in bacteria, fungi, and plants, but not animals. This expression pattern makes shikimate kinase an attractive target for the development of new
29
antimicrobial agents, herbicides, and antiparasitic agents50. In using these models, six potent inhibitors with low IC50 values (<8.0 μM) were identified. Site-directed mutagenesis studies revealed that critical conserved interacting residues contribute to specific pocket-moiety interaction anchors (S15, D33, F48, R57, R116, and R132). These results illustrate a robust
antimicrobial agents, herbicides, and antiparasitic agents50. In using these models, six potent inhibitors with low IC50 values (<8.0 μM) were identified. Site-directed mutagenesis studies revealed that critical conserved interacting residues contribute to specific pocket-moiety interaction anchors (S15, D33, F48, R57, R116, and R132). These results illustrate a robust