Chapter 2: Pharmacophore Identification
2.1 A Pharmacophore-Based Evolutionary Approach for Screening Selective Estrogen Receptor
2.1.1 Introduction
Virtual screening (VS) of molecular compound libraries has emerged as a powerful and inexpensive method for the discovery of novel lead compounds for drug development 25; 26. Given the structure of a target protein active site and a potential small ligand database, VS predicts the binding mode and the binding affinity for each ligand and ranks a series of candidate ligands. There are four main reasons for the rapid acceptance and success of VS: 1) The availability of the growing number of protein crystal structures; 2) The advent of structural proteomics technologies; 3) The enrichment and speed of VS 25; 52; and 4) The contribution of VS to the reduction in the cost of drug discovery. VS generally encompasses four phases based both on high-throughput molecular docking methods and the crystal structures of the target protein. These include target protein preparation, compound database preparation, molecular docking, and post-docking analysis 25. The molecular docking method screens the compound library to find lead compounds for the target protein, whereas post-docking analysis enriches the hit rate and optimizes the confirmed lead molecules through structure-activity relationship 53.
The VS computational method involves two basic critical elements: efficient molecular docking and a reliable scoring method. A molecular docking method for VS should be able to screen a large number of potential ligands with reasonable accuracy and speed. The many molecular docking approaches that have been developed can be roughly categorized as rigid docking 54, flexible ligand docking 3; 5, and protein flexible docking. Most current VS methods employ flexible docking tools, such as incremental and fragment-based approaches (DOCK 55 and FlexX 3) and evolutionary algorithms (GOLD 5, AutoDock 56, and GEMDOCK 2).
Scoring methods for VS should effectively discriminate between correct binding states and non-native docked conformations during the molecular docking phase and distinguish a small number of active compounds from hundreds of thousands of non-active compounds during the post-docking analysis. The scoring functions that calculate the binding free energy mainly include knowledge-based 57, physics-based 58, and empirical-based 59 scoring functions. The performance of these scoring functions is often inconsistent across different systems from a database search 1; 60. It has been proposed that combining multiple scoring functions (consensus scoring) improves the enrichment of true positives 1; 60.
While the field of VS may be maturing 25; 26; 52, and many good VS methods have been proposed, the promise of the virtual compound library 61 to rapidly increase the number of candidate ligands demands further improvement in terms of the computational efficiency of flexible docking
37
algorithms 3; 5; 56. In addition, some VS methods are capable of identifying so-called
“pharmacological preference” that is often the important interactions or binding-site hot spots typically evolved from known active ligands and the target protein 62; 63. These preferences might improve screening accuracy and guide the design and selection of lead compounds for subsequent investigation and refinement during lead discovery and lead optimization processes. Finally, the screening quality of docking methods using energy-based scoring functions alone is often influenced by the molecular weight and the structure of the ligand being screened (e.g., the numbers of charged and polar atoms). These methods are often biased toward both the selection of high molecular weight compounds (due to the contribution of the compound size 64; 65) and charged polar compounds (due to the pair-atom potentials of the electrostatic energy and hydrogen-bonding energy).
To address the above issues, we developed a new VS method, termed GEMDOCK (Generic Evolutionary Method for molecular DOCKing), modified from our previous studies 2; 66. GEMDOCK is an evolutionary-based approach, which was applied in some fast VS algorithms 5; 56. Our approach uses multiple operators (e.g., discrete and continuous genetic operators) that cooperate using family competition (similar to a local search procedure) to balance exploration and exploitation.
Like some VS methods 63; 67; 68, GEMDOCK evolves the pharmacological preferences from a number of known active ligands to take advantage of the similarity of a putative ligand to those that are known to bind to a protein’s active site, thereby guiding the docking of the putative ligand.
However, unlike existing pharmacophore-based docking methods, we developed and incorporated a new scoring function that evolves a pharmacological consensus (e.g., hot spots) and ligand preferences using the target protein and known active ligands. This scoring function not only serves as the basis for molecular docking but also ranks the screened ligands prior to post-docking analysis by reducing the deleterious effect of certain structural features within some of the ligands.
While GEMDOCK is generally applicable, in particular it has been validated by its application to the docking of a number of selective estrogen receptor modulators (SERMs) that are of great interest in cancer chemotherapy as well as estrogen replacement therapy in postmenopausal women69; 70; 71. To evaluate the strengths and limitations of GEMDOCK and to compare it with several widely used methods (DOCK, GOLD, and FlexX), we evaluated the screening utility of GEMDOCK by testing human estrogen receptor (ER) with the ligand data set, as proposed by Bissantz et al. 1 We also accessed whether our new scoring function was applicable to both the molecular docking and ligand scoring during virtual screening. The screening performance of GEMDOCK on this ligand data set is superior to that of the best available methods, and the docking accuracy is also comparable. Thus, GEMDOCK constitutes a rapid method that reduces the number of false positives during the screening of large databases when both pharmacological interactions and ligand preferences are mined from known active compounds. When known active ligands were not available, the screening accuracy of GEMDOCK is somewhat influenced and is comparable to that of comparative methods on this ligand data set.
38
2.1.2 Materials and Methods
GEMDOCK was modified and enhanced from our previous tool 2 for VS (Figure 2.1.1).
GEMDOCK can be sequentially applied to prepare target proteins and ligand databases, predict docked conformations and binding affinity using flexible ligand docking, and rank a series of candidates for post-docking analysis. In this section, we give details of the ligand database and target protein preparations, outline the scoring function used in this study, describe details of mining binding-site pharmacological interactions (e.g., hot spots) and ligand preferences, and briefly describe the docking method.
Prepare drug database
Prepare target protein
Flexible docking
Post-docking analysis Known active
ligands
Mine ligand consensus
Mine bind-site pharmacological
consensus Mining pharmacological
consensus
: Main flow : Mining/aided flow Superimpose X-ray or
overlay predicted ligand conformations
Figure 2.1.1. The main steps of GEMDOCK for virtual database screening, including the target
protein and compound database preparation, flexible docking, and post-docking analysis.GEMDOCK mines a pharmacological consensus from the target protein and known active ligands when available.
Preparations of ligand databases and target proteins
SERMs exert their physiological effects by binding to the two currently known estrogen receptors (ERα or ERβ), which are members of the nuclear receptor superfamily of ligand-dependent transcription factors; moreover, SERMs display tissue-selective estrogen agonistic or antagonistic profiles 69; 70; 71. SERMs often beneficially affect the cardiovascular and central nervous systems and exert significant estrogen-like effects on some estrogen targets such as bone, lipid, breast, and uterine cells. Despite of the benefits of SERMs, long-term treatment with SERMs is often limited by
39
intolerable side effects, such as benign and malignant uterine lesions. Therefore, the design of new SERMs has become a challenging task.
We used the ligand data set and initial ligand conformation from the comparative studies of Bissantz et al. 1 (e.g., DOCK, FlexX, and GOLD) to evaluate the screening accuracy of GEMDOCK using the ER antagonists. The ligand data set included the 10 known active compounds (EST01–10) listed in Figure 2.1.2 and 990 randomly chosen compounds from the Available Chemical Directory (ACD). The data set is available on the Web at http://gemdock.life.nctu.edu.tw/dock/download.php.
For screening ER agonists, a set of 10 known ER agonists (Figure 2.1.3, ESA01–10) used in this study was identical to that reported earlier 72. In total, the database used for screening ligands against the ER-antagonist complex (PDB code 3ert 71) and ER-agonist complex (PDB code 1gwr 73) contained 1,000 molecules; that is, 990 random compounds were the same for the two screens. In addition, three ER-antagonist complexes (PDB codes: 1err, 3ert, and 1hj1) and four ER-agonist complexes (PDB codes: 1gwr, 1l2i, 1qkm, and 3erd) with experimentally determined X-ray structures from the PDB were selected to evaluate not only the docking accuracy but also the pharmacological consensuses evolved from known active ligands (i.e., Figures 2.1.2 and 2.1.3) and reference proteins (Figure 2.1.4). Each ligand from the PDB was represented systematically by four characters followed by three characters. For example, in the ligand “3ert.OHT”, “3ert” denotes the PDB code and “OHT” is the ligand code in the PDB. These ligand structures are shown in Figure
2.1.2 (e.g., EST01, EST02, and EST03) and Figure 2.1.3 (e.g., ESA01, ESA02, ESA03, and ESA04).
The ER-antagonist complex (PDB code: 3ert) and ER-agonist complex (PDB code: 1gwr) were selected as reference proteins for virtual screening. These complexes were reasonable choices because their ligand-binding cavities are wide enough to accommodate a broad variety of ligands and therefore did not require binding site modifications. As shown in Figure 2.1.4, the structures of these two reference proteins complexed with tamoxifen (3ert) or estradiol (1gwr) show that both ligands bind at the same site within the core of the ligand-binding domain and that each ligand induces a different conformation of helix 12 (H12). Comparison of the structures of these two complexes reveals that the H12 (blue) sits above the ligand-binding cavity in the ER-agonist complex (1gwr), thereby forming a lid. In contrast, the side chains of antagonists (e.g., tamoxifen and raloxifene) in the ER-antagonist complexes prevent the agonist-like-induced conformational change of H12 (green), projecting out of the ligand-binding pocket. When preparing the size and location of the ligand-binding site, we considered the protein atoms located less than 10 Å from each ligand atom.
The metal atoms were retained and all structured water molecules were removed from the active site.
GEMDOCK then assigned a formal charge and atom type for each protein atom based on our previous study 2.
40
Figure 2.1.2. Ten known ER antagonists are studied with respect to evolving the pharmacological
consensus and docking against the ER-antagonist complex. Three ligands, EST01–03, are obtained from the PDB and each ligand is denoted by four characters followed by three characters, as in the PDB (e.g., 3ert.OHT, “3ert” denotes the PDB code and “OHT” is the ligand name in the PDB).O
Figure 2.1.3. Ten known ER agonists are docked against the ER-agonist complex (PDB code 1gwr),
and the pharmacological consensus is evolved. Four ligands, ESA01–04, are obtained from the PDB and each ligand is represented by four characters followed by three characters in the PDB.41
H12 (3ert)
H12 (1gwr)
L L
Figure 2.1.4. Comparing the binding sites of the ER reference proteins by superimposing the
complexes of the ER agonists (yellow, PDB code: 1gwr) and ER antagonists (blue, PDB entry: 3ert).The bound ligands (estradiol and tamoxifen are shown in red. In the ER-agonist complex, helix 12 (H12) (blue) sits above the ligand-binding cavity, forming a lid. H12 in the ER-antagonist complex protrudes from the pocket.
-6 -1 4 9 14 19
Distance (r)
Energy
V6
V5
V3 V4 V2
V1
20 -0.4 6.0 4.5 3.6 3.3 Steric (van-der)
20 -2.5 3.6 3.1 2.6 2.3 H-bond (polar)
V6 V5 V4 V3 V2 V1 Interactive type
Electrostatic
Steric H-bond
Figure 2.1.5. The linear energy function of pair-wise atoms for steric interactions (light line),
hydrogen bonds (bold line), and electrostatic potential in GEMDOCK.42
Scoring function
We developed a new scoring function that simultaneously serves as the scoring function for both molecular docking and the ranking of screened compounds for post-docking analysis. This function consists of a simple empirical binding score and a pharmacophore-based score to reduce the number of false positives. The energy function can be dissected into the following terms:
ligpre pharma
bind
tot
E E E
E
= + + (2.1.1)where Ebind is the empirical binding energy, Epharma is the energy of binding site pharmacophores (hot spots), and Eligpre is a penalty value if a ligand does not satisfy the ligand preferences. Epharma and
E
ligpre are especially useful in selecting active compounds from hundreds of thousands of non-active compounds by excluding ligands that violate the characteristics of known active ligands, thereby improving the number of true positives. The values of Epharma and Eligpre are determined according to the pharmacological consensus derived from known active compounds and the target protein. In contrast, the values of Epharma and Eligpre are set to zero if active compounds are not available.The empirical-binding energy (Ebind) is given as
penal intra
inter
bind
E E E
E
= + + (2.1.2)where Einter and Eintra are the intermolecular and intramolecular energies, respectively, and Epenal is a large penalty value if the ligand is out of the range of the search box. For our present work, Epenal was set to 10,000. The intermolecular energy is defined as
∑∑ ( )
factor that converts the electrostatic energy into kilocalories per mole. The lig and pro denote the numbers of the heavy atoms in the ligand and receptor, respectively.F ( ) r
ijBij is a simple atomic pair-wise potential function (Figure 2.1.5), as defined in our previous study 2 where rijBijis the distance between atoms i and j with interaction type Bij formed by pair-wise heavy atoms between ligands and proteins, Bij is either a hydrogen bond or a steric state. In this atomic pair-wise model, these two potentials are calculated by the same function form but different parameters, V1, . . . , V6given in Figure 2.1.5. The energy value of a hydrogen bonding should be larger than that for steric potential. In this model, atoms are divided into four different atom types 2: donor, acceptor, both, and nonpolar. A hydrogen bond can be formed by the following pair-atom types: donor-acceptor (or acceptor-donor), donor-both (or both-donor), acceptor-both (or both-acceptor), and both-both. Other pair-atom combinations are used to form the steric state. We used the atom formal charge to
43
calculate the electrostatic energy 2, which is set to 5 or −5, respectively, if the electrostatic energy is more than 5 or less than −5. These parameters, V1 to V6, and the maximum electrostatic energy were refined according to the docking accuracies of our previous work 2 on a highly diverse dataset of 100 protein-ligand complexes proposed by Jones et al.5
The intramolecular energy of a ligand is
( ) [ ( ) ]
and dihed is the number of rotatable bonds in a ligand. We followed the work of Gehlhaar et al. 59 to
set the values of A, m, and θ
0. For the sp3-sp
3 bond, A = 3.0, m = 3, and θ
0 = π; for the sp3-sp
2 bond,
A = 1.5, m = 6, and θ
0 = 0.
Mining pharmacological consensuses
GEMDOCK evolves the binding-site pharmacological consensus and ligand preferences from both known active ligands and the target protein to improve screening accuracy. We used the premise that previously acquired interactions (hot spots) between ligands and the target protein can be used to guide the selection of lead compounds for subsequent investigation and refinement. When known active ligands were available, GEMDOCK used a pharmacophore-based scoring function (Equation
2.1.1). On the other hand, LP
elec and LPhb were set to zero and GEMDOCK used a purely empirical-based scoring function (Equation 2.1.2) if known active compounds were not available.For each known active ligand, GEMDOCK first yielded 5 docked ligand conformations by docking the ligand into the target protein, and only the docked ligand conformation with the lowest energy was retained for pharmacological consensus analysis. The protein-ligand interactions were extracted by overlapping these lowest-energy docked conformations, and the interactions were classified into two different types, including hydrogen bonding and hydrogen-charged interactions.
After all of the protein-ligand interactions were calculated, the atom interaction-profile weight of the target protein representing the pharmacological consensus of a particular interaction was given as
N
where N is the number of known active compounds and fjkis the total interaction number of an atom j (in a protein) interacting with an atom of known active ligands with the interaction type k (e.g., hydrogen bonding or hydrogen-charged interactions). In this work, an atom j (in a protein) was considered to interact with an atom i (in a ligand) if the distance between the atoms j and i ranges from (V1+ V2)/2 to (V3+ V4)/2, where V1, . . . , V4 are given in Figure 2.1.5. An atom j in the reference protein was considered a hot-spot atom when Qkj was more than 0.5.
44
The pharmacophore-based interaction energy (Epharma) between the ligand and the protein is calculated by summing the binding energies of all hot-spot atoms:
( ) ( )
where CW(Bij) is a pharmacological-weight function of a hot-spot atom j with interaction type Bij,( ) r
ijBijF
is defined as in Equation 2.1.3, lig is the number of heavy atoms in a screened ligand, and hs is the number of hot-spot atoms in the protein. The CW(Bij) is given asQj is the atomic pharmacological-profile weight (Equation 2.1.5) and k is the interaction type of the hot-spot atom j.
We evolved the ligand preferences (Eligpre) from known ligands to reduce the deleterious effects of screening ligand structures that are rich in charged or polar atoms. Docking methods using energy-based scoring functions are often biased toward such compounds, which abound with charged and polar atoms (i.e., hydrogen donor or acceptor atoms) because the pair-atom potential of the electrostatic energy and hydrogen bonding energy is always larger than the steric energy. For example, the atomic pair-wise potential energies of the electrostatic, hydrogen bond, and steric potential were set to −5, −2.5, and −0.4 in this work. The ligand preference (Eligpre) is a penalty value for those screened ligands that violate the electrostatic or hydrophilic constraints. The Eligpre is given as
hb elec
ligpre
LP LP
E
= + (2.1.8)where LPelec and LPhb are the penalties for the electrostatic (i.e., the number of charged atoms of a screened ligand) and hydrophilic (i.e., the fraction of polar atoms in a screened ligand) constraints, respectively. LPelec is defined as
elec
, NAelec is the number of charged atoms of a screened ligand and UBelec is the upper bound number of charged atoms derived from known active compounds. θelec is the maximum number of charged atoms among known active compounds, and σelec is the standard derivation of the charged atoms of known active compounds. LPhb is defined as
45
, rhb is the fraction of polar atoms (i.e., the atom type is both, donor, or acceptor) in a screened ligand and Urhb is the upper bound of the fraction of polar atoms calculated from known active ligands.
NA
hb and NAt are the number of polar atoms and the total number of the heavy atoms of a screened ligand, respectively. θhb and σhb are the maximum ratio and the standard derivation of the ratios of polar atoms evolved from known ligands, respectively.In order to reduce the deleterious effects of biasing toward the selection of high molecular weight compounds, we formulate a normalization strategy defined as
⎪⎪
where Ebind
is the empirical binding energy (Equation 2.1.2), NA
t is the total number of the heavy atoms in a screened ligand, and μmw is the mean of the number of heavy atoms in known active compounds. When the normalization strategy is applied, the energy function (Equation 2.1.1) is given asHere, we present the outline of our molecular docking method that is a generic evolutionary method enhanced from our original technique 2. The core idea of our evolutionary approach was to design multiple operators that cooperate using the family competition model, which is similar to a local search procedure. The rotamer-based mutation operator, a discrete operator, is used to reduce the search space of ligand structure conformations. The Gaussian and Cauchy mutations, continuous genetic operators, search the orientation and conformation of the ligand relating to the center of the target protein.
After the ligand database and the target protein were prepared and the pharmacological preferences were evolved, we first specified the crystal coordinates of the protein atoms from the PDB and assigned a formal charge and atom type for each protein atom. GEMDOCK then automatically decides the search cube of a binding site based on the maximum and minimum values of coordinates among these selected protein atoms. For each ligand in the database, the GEMDOCK takes the atomic coordinates from the ligand database and assigns a formal charge and atom type for each atom. It then sequentially predicts the binding conformation and estimates the binding affinity for
46
each ligand. Finally, GEMDOCK ranks these docked ligand conformations for use in the post-docking analysis.
Our docking method works as follows: It randomly generates a starting population with N docked structures by initializing the orientation and conformation of the ligand relating to the center of the target protein. Each solution is represented as a set of three n-dimensional vectors (xi
, σ
i, ψ
i), where n is the number of adjustable variables of a docking system and i = 1, . . ., N where N is the populationOur docking method works as follows: It randomly generates a starting population with N docked structures by initializing the orientation and conformation of the ligand relating to the center of the target protein. Each solution is represented as a set of three n-dimensional vectors (xi