Thesis overview - 蛋白質-配體結合模式預測與其結合區域定性研究

Chapter 1. Introduction

1.2 Thesis overview

For addressing above issues, some studies have been reported (Fig. 1.1). Three of our related studies were briefly described in Chapter 2. The study of the pharmacophore-based scoring function proposed a target-specific scoring function by utilizing the protein-ligand interactions and physic-chemical properties of known actives to improve the accuracy and precision for the ranking of VS data (Fig. 1.1a). The studies of consensus scoring and cluster

Compound databases

GEMDOCK, GOLD, DOCK,

and et al.

Virtual screening / molecular docking a

Bioassay and identify active ligands Post-screening analysis

… E353 L387 R394 L346 T347 L525 L346 L387 F404 M343 M421 L525H1 V3V2V1

• SiMMap (site-moiety map)

Active compounds Unknown compounds

Interaction cluster

EBDEABDABCDBCDEABCEACDEABDEABCDE

Combinations

Average False Positive Rate (%)

rank combination score combination

Figure 1.1. Overview of structure-based drug design and related works. The major steps of structure-based drug design include (a) virtual screening and (b) post-screening analysis and following bioassay.

analysis addressed the issues of improving enrichment for the post-screening analysis stage (Fig. 1.1b). Furthermore, we also applied these methods on the inhibitor discoveries of the

dengue virus E protein and the influenza virus neuraminidase. Although some of novel inhibitors were discovered in these researches, we still found the drawbacks of these previous studies. Firstly, the pharmacophore-based scoring function is limited by the consensus of known active compounds. Second, the consensus scoring criteria and cluster analysis are helpful for improving the enrichment of VS, but these methods does not use the protein-ligand interaction data and ligand structures produced in the VS process for investigating the key environment of the protein-ligand binding site.

To address these issues, we developed the SiMMap approach to infer the key features by a site-moiety map describing the relationship between the moiety preferences and the physico-chemical properties of the binding site in Chapter 3 (Fig. 1.1b). The further application and validation of SiMMap was presented in the Chapter 4. According to our knowledge, SiMMap is the first public server that identifies the site-moiety map from a query protein structure and its docked (or co-crystallized) compounds. The server characterizes a binding site by pocket-moiety interaction preferences (anchors) including binding pockets with conserved interacting residues, moiety preferences, and interaction type.

In Chapter 4, we further extended SiMMap to orthologous SiMMap. We derived the orthologous site-moiety maps (orthologous SiMMap) from identifying consensus binding environments of orthologous proteins; orthologous SiMMap represents the conserved binding environment or "hot spots" among orthologous targets in an aim to investigate the protein-ligand interface family and apply for discovering potential leads across multiple species.

Finally, Chapter 5 described some conclusions and future perspectives.

The research framework of this thesis is shown as Figure 1.2. The concept of the research of pharmacophore-based scoring function is that utilizing the consensus of known active compounds identifies the key feature of binding site. However, such approach needs the known active compounds and prefers the compounds similar with the known set. To address these limitations, we extract the consensus of screening compounds to characterize the binding site and further validate on the inhibitor discovery of orthologous shikimate kinases.

5 Future work

Pathdrug

Pharmacophore-based scoring function From consensus of known active compounds 1. Pharmacological interactions (e.g., hot spots) 2. Ligand preferences (e.g., charged and polar)

Site-moiety map

From consensus of screened compounds to characterize binding site

1. Pockets with conserved interacting residues 2. Moiety composition

3. Pocket-moiety interaction type Orthologous site-moiety map

Conserved environments of orthologous targets 1. Consensus physicochemical properties

2. Consensus moiety preferences

D351-OD1

E353-OE2

R394-NH2

H524-ND1

B C

V1 V2

R394

T347

F404

M421 M343

L525 E353

L387

L346

V2E1

H2 H3 V1 H1

V2E1 H2 H3 V1 V3 H1

H1 V1

E1 H3

Figure 1.2. The research framework for predicting protein-ligand binding modes and characterizing protein-ligand binding sites in structure-based drug design.

Chapter 2 Related works

Virtual screening (VS) of molecular compound libraries has emerged as a powerful and inexpensive method for the discovery of novel lead compounds for drug development ^2-3 (Fig.

2.1). The VS computational method involves two basic critical elements: efficient molecular docking and a reliable scoring method. Scoring methods for VS should effectively discriminate between correct binding states and non-native docked conformations during the molecular docking phase and distinguish a small number of active compounds from hundreds of thousands of non-active compounds during the post-docking analysis. The scoring functions that calculate the binding free energy mainly include knowledge-based¹², physics-based¹³, and empirical-based ¹⁴ scoring functions.

In addition, some of these VS methods are capable of identifying so-called

“pharmacological preference” that is often the important interactions or binding-site hot spots typically evolved from known active ligands and the target protein^21-22 (Fig. 2.1b). These preferences might improve screening accuracy and guide the design and selection of lead compounds for subsequent investigation and refinement during lead discovery and lead optimization processes. However, the pharmacological preferences for each protein target and corresponded ligands are limited by the demand of pre-studied bioassays or structure data.

Currently, the screening quality of docking methods using energy-based scoring functions alone is often influenced by the molecular weight and the structure of the ligand being screened (e.g., the numbers of charged and polar atoms) (Fig. 2.2). These methods are often biased toward both the selection of high molecular weight compounds (due to the contribution of the compound size ^28-29) and charged polar compounds (due to the pair-atom potentials of the electrostatic energy and hydrogen-bonding energy).

Compound databases

GEMDOCK, GOLD, DOCK, and et al.

Virtual screening / molecular docking

Prepare target protein

Prepare compound database a

Bioassay and identify active ligands Virtual screening results from

multiple scoring methods Post-screening analysis

Select representatives and improve hits or predicted ligand

conformations

0.00 0.20 0.40 0.60 0.80 1.00 g (Pl/Ph)

Average False Positive Rate (%)

rank combination

Active compounds Unknown compounds

Interaction cluster

1.0 intra_DHFR1 intra_DHFR2

intra_ESA

Figure 2.1. Main procedure of structure-based virtual screening. (a) The major steps of structure-based virtual screening, including virtual screening, post-screening analysis, and bioassay. (b) Pharmacophore-based scoring function for virtual screening step. Post-screening analysis step is usually utilized for improving including (c) consensus scoring and (d) cluster analysis.

In the meanwhile, the performance of these scoring functions is often inconsistent across different systems from a database search ^17-18. The inaccuracy of the scoring methods, i.e., inadequately predicting the true binding affinity of a ligand for a receptor, is probably the major weakness for VS. Furthermore, the application of VS^2,30, to the drug discovery process invariably produces a large number of potential lead candidates. These prospective ligands need to be filtered in order to reduce their number for more precise and labor-intensive studies.

Hence, it is urgent that the utilizations of post-analysis to minimize the number of false positives in the selection list and to propagate the true hits to the top of the list. (Fig. 2.1a, 2.1c

8 and 2.1d)

Group B Group A

O O

O N

O O

Group A

Group B

ESA01

EST03 R394-NH2

E353-OE2

H524-ND1

O O

O O O

A:ESA01

(-91.32) B: ESA01-CH₃

(-76.86) C: ESA01-COO -(-99.64)

a b

Figure 2.2. The influences of ligand structures and molecular weight on docking energy. (a) The fraction of polar atoms in ESA01-C is the smallest among these 3 ligands, whereas that of ESA01COO is the largest. The docked positions are similar, but the docking energies differ: -91.32 for ESA01, -76.86 for ESA01-CH3, and -99.64 for ESA01-COO. (b) ESA01 (blue) and EST03 (yellow) have a common group A, and EST03 has an additional substructure group B.

The docked conformations (into reference protein 3ert) are similar, and the docking energies are -82.82 for ESA01 and -127.27 for EST03.

It has been reported that fusion among different scoring methods in VS would improve the performance and, on average, the performance of the combined method performs better than the average of the individual scoring functions.15,18-20,31 These reported results are significant and potentially robust in that the performance results of these consensus scoring (CS) methods seem to be independent of the target receptor and the docking algorithm. The reported results seem to depend on the method of combination (by rank, by score, by intersection, by MIN, by MAX, and by voting) and the number and nature of individual scoring functions involved in the combination. While researchers have come to realize the advantage and benefit of method combination and consensus scorings, the major issues of how and when these individual scoring functions should be combined remain a challenging problem not only for researchers

but also perhaps more importantly, for practitioners in virtual screening.

Another frequently used technique for post-screening analysis is cluster analysis.

Clustering methods based on compound structural similarity or interacting profiles can group VS data, reduce complexity of observation, and improve the performance of the scoring function^32-34. Through the cluster analysis, the enormous data produced by VS process is able to easily visualize and efficiently handle. However, most of researchers only consider the descriptors of protein-ligand interactions or compound structures individually. The combination of protein-ligand interactions and compound topology could provide more detail and pure classifications for following biological assay and refinement. Therefore, some of related studies are briefly introduced as following (Fig. 2.1b and 2.2d).

A OH

N D351-OD1

E353-OE2

R394-NH2

H524-ND1

R394-NH2 E353-OE2

(a) Antagonists (b) Agonists

A’

OH B’

B C

Figure 2.3. The binding-site pharmacological consensuses are identified by overlapping the docked conformations of (a) 10 known ER antagonists and (b) 10 known ER agonists against the reference proteins 3ert and 1gwr, respectively. (a) Four pharmacological interactions were identified and circled as A (phenolic hydroxyl group), B (phenolic hydroxyl group), and C (piperidine nitrogen). (b) Three pharmacological interactions were identified and circled as A (phenolic hydroxyl group) and B (phenolic hydroxyl group). The dashed lines indicate the hydrogen bonds formed between the ligand and the target protein. These pharmacological interactions are consistent with those evolved from X-ray structures.

10 2.1 Pharmacophore-based scoring functions

The screening quality of docking methods using energy-based scoring functions alone is often influenced by the molecular weight and the structure of the ligand being screened (e.g., the numbers of charged and polar atoms). These methods are often biased toward both the selection of high molecular weight compounds (due to the contribution of the compound size

28-29) and charged polar compounds (due to the pair-atom potentials of the electrostatic energy and hydrogen-bonding energy).

A pharmacophore-based evolutionary approach for virtual screening was developed to address these issues. This tool, termed the Generic Evolutionary Method for molecular DOCKing (GEMDOCK), combines an evolutionary approach^23,35-37 with a new pharmacophore-based scoring function. The former integrates discrete and continuous global search strategies with local search strategies to expedite convergence. The latter, integrating an empirical-based energy function and pharmacological preferences (binding-site pharmacological interactions and ligand preferences shown as Fig. 2.3), simultaneously serves as the scoring function for both molecular docking and post-docking analyses to improve screening accuracy (Fig. 2.4). We apply pharmacological-interaction preferences to select the ligands that form pharmacological interactions with target proteins, and use the ligand preferences to eliminate the ligands that violate the electrostatic or hydrophilic constraints. We assessed the accuracy of our approach using human estrogen receptor (ER) and a ligand database from the comparative studies of Bissantz et al.¹⁷ Using GEMDOCK, the average goodness-of-hit (GH) score was 0.83 and the average false positive rate was 0.13% for ER antagonists, and the average GH score was 0.48 and the average false positive rate was 0.75%

for ER agonists. The performance of GEMDOCK was superior to competing methods such as GOLD and DOCK. We found that our pharmacophore-based scoring function indeed is able to reduce the number of false positives; moreover, the resulting pharmacological interactions at the binding site as well as ligand preferences are important for assigning confidence to the results of virtual screening experiments. These results suggest that GEMDOCK constitutes a robust tool for virtual database screening.

Prepare drug database

Prepare target protein

Molecular docking

Post-docking analysis Known active

compounds

Mine ligand preferences

Mine bind-site pharmacological

consensus Mining pharmacological

consensus

: Main flow : Mining/aided flow Superimpose X-ray or

predicted ligand conformations

Figure 2.4. The main steps of GEMDOCK for virtual database screening, including the target protein and compound database preparation, flexible docking, and post-docking analysis.

GEMDOCK mines a pharmacological consensus from the target protein and known active ligands when available.

2.2 Consensus scoring criteria

The performance of these scoring functions is often inconsistent across different systems from a database search ^18,31. The inaccuracy of the scoring methods, i.e., inadequately predicting the true binding affinity of a ligand for a receptor, is probably the major weakness for VS. It has been demonstrated that combining multiple scoring functions (consensus scoring) improves enrichment of true positives. Previous efforts at consensus scoring have largely focused on empirical results, but they are yet to provide theoretical analysis that gives insight into real features of combinations and data fusion for VS.

We explore consensus scoring (CS) criteria and provide a consensus scoring procedure for improving the enrichment in VS using data fusion and exploring diversity on scoring characteristics between individual scoring functions (Fig. 2.5). In particular, we demonstrate that combining multiple scoring functions improves enrichment of true positives only if (a) each of the individual scoring functions has relatively high performance, and (b) the scoring characteristics of each of the individual scoring functions are quite different (Fig. 2.6). These two prediction variables are also indicative criteria for the performance between rank

combination and score combination. Moreover our second criterion (b) using the rank/score characteristics as the scoring diversity is independent of the performance of the individual scoring function. It is therefore very useful in practical settings in the VS process when the performance of an individual scoring function (such as in criterion (a)) is not known or cannot be evaluated at the juncture. We have developed a novel CS system, available online http://gemdock.life.nctu.edu.tw/dock/download.php, which was tested for five scoring systems with two evolutionary docking algorithms on four targets, thymidine kinase (TK), human dihydrofolate reductase (DHFR), and estrogen receptors (ER) of antagonists and agonists (Fig.

2.7). Our procedure is computationally efficient, able to adapt to different situations, and scalable to a large number of compounds and to a greater number of combinations. Results of the experiment show a fairly significant improvement on the goodness-of-hit (GH) scores, false positive (FP) rate, and enrichment factors over average individual performance. This approach has practical utility for cases where the basic tools are known or believed to be generally applicable, but where specific training sets are absent.

ER antagonist ER agonist

0.0

Figure 2.5. Rank/score curves of five methods for four virtual screening targets: (a) TK, (b) DHFR, (c) ER-antagonist receptor, and (d) ER-agonist receptor.

0.00 0.20 0.40 0.60 0.80 1.00 g (R/Svar)

0.00 0.20 0.40 0.60 0.80 1.00

g (Pl/Ph)

0.00 0.20 0.40 0.60 0.80 1.00

g (Pl/Ph)

0.00 0.50 1.00 1.50 2.00

CSindex

GH score improvement

RCSSCS

c d

Figure 2.6. The relationships between the GH-score improvement with (a) normalized value of variance of rank/score graph and (b) normalized value of Pl/ Ph of 40 pairing combinations of five methods for four virtual screening targets. (c) The GH-score improvements with normalized variances of rank/score graphs (R/Svar) and normalized relative performance measurement (Pl/ Ph) of 40 RCS and SCS pairing combinations of five methods for four virtual screening targets. (d) The positive and negative GH-score improvements are denoted with circle and cross, respectively.

Figure 2.7. The known active ligands of four VS targets, estrogen receptors (ER) of antagonists (a) and agonists (b), (c) thymidine kinase (TK), and (d) human dihydrofolate reductase (DHFR). The ligand data set from the comparative studies of Bissantz et al. ¹⁷was used to evaluate the screening accuracy of different CS on TK, DHFR, ER, and ERA. For each target protein, the ligand database included 10 known active compounds and 990 random compounds.

2.3 Combinative clustering analysis

The increasing numbers of 3D compounds and protein complexes stored in databases contribute greatly to current advances in biotechnology, being employed in all kinds of pharmaceutical and industrial applications. However, screening and retrieving appropriate candidates as well as handling false positives presents a challenge for all post-screening analysis methods employed in retrieving therapeutic and industrial targets.

Using the combinative clustering method (Fig. 2.8), virtually screened compounds were clustered based on their protein-ligand interactions then structure clustering employing physical-chemical features was done to retrieve the final compounds. Based on the protein-

Active compounds Unknown compounds Unknown Active: old compounds

Active: new Unknown compounds interaction cluster (45)

Select lowest energy conformations as representative structures

Figure 2.8. Overall process of the two-stage combinative cluster analysis. (a) First stage clustering using protein-ligand interactions generated via GEMDOCK. (b) Second stage clustering of first stage results done using physical-chemical features.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Figure 2.9. Designing a reference threshold of P-L interaction and atom-pair descriptors. The complementation between atom-pair descriptor and the protein-ligand interaction descriptor is also show in this figure. The distance threshold of atom-pair descriptor was 0.55 (tanimoto coefficient). The threshold of distance of protein-ligand interaction descriptor was 0.39 (correlation coefficient).

ligand interaction profile (first stage), docked compounds can be clustered into groups with distinct binding interactions. Structure clustering (second stage) grouped similar compounds obtained from the first stage into similar structures clusters; the lowest energy compound from each cluster being selected as a final candidate. By representing interactions at the atomic-level and including measures of interactions strength (Fig. 2.9), better descriptions of protein-ligand interactions and a more specific analysis of virtual screening was achieved. The two-stage clustering approach enhanced our post-screening analysis by revealing accurate performances in clustering, mining and visualizing compound candidates, thus, improving virtual screening enrichment.

2.4 Summary

As the number of protein structures increases rapidly, structure-based drug design and virtual screening approaches are becoming important and helpful in lead discovery^1-2,6. A number of docking and virtual screening methods 16,23-24,35 have been utilized to indentify lead compounds, and some success stories have been reported 1-2,4-5,7-8,10. However, identifying lead compounds by exploiting thousands of docked protein-compound complexes is still a challenging task. The major weakness of virtual screenings is likely due to incomplete understandings of ligand binding mechanisms and the subsequently imprecise scoring algorithms . In the related works, several studies were proposed for improving the accuracy and precision in the VS processes. First, the scoring function of GEMDOCK evolves the pharmacological preferences from a number of known active ligands to take advantage of the similarity of a putative ligand to those that are known to bind to a protein’s active site, thereby guiding the docking of the putative ligand. In the post-screening analysis process, the consensus scoring strategy using data fusion and exploring diversity on scoring characteristics between individual scoring functions for improving VS is proposed. When the huge amount of VS data needs to be interpreted, the combinative cluster analysis is applied for effectively mining the representatives and easily visualizing the VS data. Although we have been successfully applied these methods on the VS studies of two important virus targets, dengue virus and influenza virus, some shortcomings are needed to be addressed.

Chapter 3 Site-moiety map for recognizing interaction preferences between protein pockets and compound moieties

3.1 Introduction

Most of docking programs^16,23-24 use energy-based scoring methods which are often biased toward both the selection of high molecular weight compounds and charged polar compounds in the top ranks. Meanwhile, these approaches generally cannot identify the key features (e.g., pharmacophore spots) that are essential to trigger or block the biological responses of the target protein. Although pharmacophore techniques²⁷ have been applied to derive the key features, these methods require a set of known active ligands that were acquired experimentally. Therefore, the more powerful techniques for post-screening analysis to identify the key features through docked compounds and to understand the binding mechanisms provide a great potential value for drug design.

To address these issues, we presented the SiMMap method to infer the key features by a

在文檔中蛋白質-配體結合模式預測與其結合區域定性研究 (頁 13-0)