INTRODUCTION - The Relevance of Protein-Ligand Interaction Profiles in Computer-Aided

CHAPTER 3. The Relevance of Protein-Ligand Interaction Profiles in Computer-Aided

3.1 INTRODUCTION

Identification of protein-ligand interaction networks on a proteome scale is crucial in addressing a wide range of biological issues such as correlating molecular functions to physiological processes and designing safe and efficient target compounds which can be used in therapeutics, nutrition, cosmetics, skin care products, agriculture and industry. In order to understand the role and significance of protein-ligand interactions (Fig. 4) in various applications throughout the field of bioinformatics and biotechnology the properties and functions of a ligand [42, 43] must be well addressed. As seen previously, the ligand (vitamin D, Fig. 1) is a molecule, ion or atom which can bind to a specific location or the binding site of a protein [39, 44].

Currently, antibodies are the most commonly used ligands in biotechnology and life-science investigations, although protein scaffolds (protein regulators), nucleic acids and peptides (repeating structural units in amino acids) are also employed. Since protein-ligands complexes of various compounds are used in cosmetics, hair dyes, skin care products, fertilizers, detergents [29-31] and nutrition supplements [10], protein-ligand interaction profiles and physico-chemical features could be used in the identification of such lead compounds.

a b

Figure 4. View of protein-ligand binding interactions in Betalactoglobulin (a transporter protein) complexed with vitamin D using Swiss PDB viewer. a) Electrostatic potential and molecular surface. b) Hydrogen bond interactions among atoms (green dotted lines).

The ligand binding site of the primary target is extracted or predicated from a 3D experimental structure or homology model of proteins [35, 45] and characterized by a geometric potential. Protein-ligand interactions occur when a ligand binds to a protein which is usually integral to the function of its cognate (assimilated or symbiotic) protein. In the binding of a ligand to a protein, the following interactions are of significance: electrostatic forces (interaction between electrically charged particles explained by Coulomb’s law), van der Walls forces (the sum of the attractive or repulsive forces between molecules or parts of the same molecule) and hydrogen bonding (the attractive interaction of a hydrogen atom with an electronegative atom which can occur inter or intramolecularly) [39, 40]. Based on these interactions, evaluations are made using ligand-based approaches employed commonly in pharmacophore modeling by using physical and chemical traits of known ligands to identify novel inhibitors. Another approach, the receptor-based, identifies ligands that use structural and other features on the target receptor to identify the best inhibitor.

Docking [18, 26, 32, 33, 46] is then used to identify the fit between a receptor and the potential ligand by screening a database of ligands against one or more target receptors via two distinct parts: docking (the search scheme to identify suitable conformations or poses) and scoring (a measure of the affinity of various poses). Scoring methods must discriminate between non-native docked conformations and correct binding states of compounds during molecular docking phase to distinguish active compounds (usually a small number) from non-active compounds (an extremely large number) during the post-docking analysis. Although there are over 60 docking programs and tools available [24], we present some of the most popular programs made publicly available (Table 1). DOCK [18], incremental construction (FlexX) [32]

and evolutionary algorithms (GEMDOCK, GOLD, AutoDock) [26, 33, 46] are used to screen and downsize compound groups in order to select suitable candidates for post-screening analysis.

However, inconsistencies in the performance of scoring functions results in inadequate prediction of true binding affinity of a ligand to a receptor; thus, combining various scoring methods in VS may improve performance than in the average individual scoring functions.

Similar inconsistencies have been noticed in information retrieval (IR) and Charifson et al. [15]

proposed a study in which they used an interaction-based consensus approach to combine scoring functions which revealed enrichment in discrimination between active and inactive enzyme inhibitors. Studies by Bissantz et al. [3], Stahl and Rarey [11] and Verdonk et al. [16]

showed works on consensus scores which further improved VS enrichment. However, the remaining issue for VS users rather than researchers is when and how these scoring functions should be combined in either drug design or industrial compounds design.

Docking programs URLs REFERENCES

DOCK http://dock.compbio.ucsf.edu/ 18 FlexX http://biosolveit.de/flexx/index.html?ct=1 32 AutoDock http://autodock.scripps.edu/ 46 GEMDOCK http://gemdock.life.nctu.edu.tw/dock/igemdock.php 26

GOLD http://www.ccdc.cam.ac.uk/products/life_sciences/gold/ 33

Table 1. Popular docking tools and evolutionary algorithms currently used in VS

Furthermore, certain VS methods can identify important interactions or binding-site hot spots obtained from known active ligands and target proteins [17]. However, due to biases towards higher molecular weight and charged polar compounds [18] docking alone is not sufficient to analyse, determine and retrieve the most adequate lead compounds therefore post screening analyses are emerging as useful methods to aid with further elimination of false positive hits obtained from VS.

Methods for post-screening analysis employing clustering to identify key features obtained via docked compounds and the understanding of binding mechanisms are of great use in bioinformatics. Therefore, computer-aided drug and industrial target design require VS as a primary step to generate interaction and structure profiles followed by post screening analysis for adequate filtering, visualization and mining of the final candidates.

3.2 The Significance of Protein-Ligand Interaction Profiles in Methods of Compound Retrieval and Post Screening Analysis

Interactions between molecules (Fig. 4) are important for understanding many biological phenomena. From gene expression to enzyme reactions, the activities are dictated by molecular interactions. Because of DNA microarray success, researchers are studying the protein counterpart in greater detail [47]. Protein microarray can be used for studying a variety of

biological phenomena such as interactions of protein-ligand, protein–protein, antibody–antigen, protein–DNA, analysis of subunits in protein complexes, screening of target proteins expressed from phage library, analysis of mutant proteins, quantitative assay, discovery of diagnostic markers, analysis of protein expression profiles, development of diagnostic microarray and development of microarray-based lead screening system. The interactions of significance in analysis and retrieval of lead compounds for drug design are intermolecular interactions such as van der Walls forces, electrostatic forces and Hydrogen bonds interactions [39, 40]. Also called interaction energies, they can be obtained from virtual screening of docked compounds calculations [13]. The calculations of interaction energies are organized into data sets of interaction profiles (IPFs) and can be used as one of the criteria in a cluster analysis to further filter out and select more specific or the final target compounds. Thus, cluster analysis of various compounds with similar interaction energies will group the various compounds into separate clusters from which a representative is chosen usually based on RMSD values while undergoing what is termed a post screening analysis.

3.2.1 Post Screening Analysis

Methods of post screening analysis [21-23] are designed to facilitate the visualization (interpretation of binding interaction), organization (cluster and organize structures in a meaningful way), analysis (compare and profile the binding interactions of different structures) and data mining (search for structures containing key interactions or specific features) of virtually screened compounds. As mentioned earlier, binding interactions [39] (e.g. van der Walls forces, electrostatic forces and hydrogen bond interactions) of protein-ligand complexes are a critical part of mining and selecting the target representatives in post analysis methods.

Descriptions of binding interactions and interaction strength measures for protein-ligand complexes are very important for better mining of appropriate candidates from selection lists generated by VS [48]. Thorough an in-depth study of protein-ligand interactions in various post screening analysis, we attempt to develop an integrated method of VS and post screening analysis in order to speed up the screening and analysis of compounds, generate better interaction-specific information and to obtain suitable representatives. The overall details of this study are shown in Figure 5.

Figure 5. Methods from previous works investigated and our studies done in the designing of our TSCC method.

Bellow we investigate and compare a few pioneering methods of post screening analysis which were all originally designed to enrich virtual screening. Later in our work we will perform some comparative studies and inductive analysis which provide a foundation for expanding the use of virtual screening and post screening analysis into the mining and analysis of targets used in various other applications besides pharmaceutics.

3.2.2 Structural Interaction Fingerprint (SIFt)

SIFt [23] uses a simple, generic and robust approach for representing and analyzing 3D protein- ligand interactions. Its key feature is the generation of an interaction fingerprint that converts 3D structural binding information into a one-dimensional (1D) binary string (Fig. 6).

The fingerprint representation of the interaction patterns is compact, and allows for rapid clustering and analysis of large numbers of complexes. The SIFt is calculated on a set of input 3D protein–small molecule complexes. The protein structure may have been determined

experimentally by NMR or crystallography, or generated through homology modeling. The SIFt is generated by first defining the union of those residues that are in contact between the protein and the small molecule complex. The resulting panel of ligand binding site residues, which act as a mask covering all of the interactions occurring between the protein and the ligands, is then used as the common reference frame to construct the interaction fingerprints.

Figure 6. The 3D binding site of protein with an inhibitor (ligand) revealed as a sequence of positions in the binding site in contact with the ligand and their location in the structure of the protein (loop and β). Each binding site position is represented by a bitstring. The joining of all bitstrings end-to-end for each binding site residue is repeated for all ligands and is used in the selection process.

To analyse SIFTs the Tanimoto coefficient (Tc) [38] is used as the quantitative measure of bit string similarity. The Tc between two bit strings A and B is defined as:

Tc(A,B)=A_IB/A_UB

where is the number of ON bits common in both A and B and is the number of ON bits present in either A or B. Tanimoto coefficients between random bit strings with a length of 400 bits adopt a near-Gaussian distribution centered at approximately 0.33, with a sigma of about 0.03. This representation of interactions as fingerprints using the SIFt method enables clustering, filtering and profiling of large libraries of docking results as well as crystal structures of the protein kinase family in complexes with various inhibitors.

3.2.3 VISCANA (Visualized Cluster Analysis of Protein-Ligand Interaction)

VISCANA [22] (Fig. 7) is a method based on the ab Initio Fragment Molecular Orbital Method (FMO) [24] used for analysis of virtual ligand screening. The ab initio FMO method at the Hartree-Fock level is shown in the details following the method figure.

Figure 7. a) The overall approach of VISCANA (from VS to the selection of representatives).

b) The fragmentation of a polypeptide at different bonds. c) Division of biomolecules into a collection of small fragments in the molecular orbital calculations (FMO method).

First, biomolecules or molecular clusters are divided into small fragments, and the ab initio MO calculations on the fragments (monomers) under the electrostatic potential from surrounding fragment pair as seen in Fig 7b and c. This is then solved repeatedly until all monomer densities become self-consistent. Finally, through the use of the total energies of the monomer EI and the dimer EIJ, the total energy of the system E is calculated by the following equation:

The FMO method has the advantage of describing the charge-transfer between a receptor and a ligand in comparison to a conventional force field method using fixed atomic charges.

Based on this principle Amari et al. developed a cluster analysis using the dissimilarity defined as the squared Euclidean distance between interfragment interaction energies (IFIEs) of two ligands. VISCANA combines a clustering method with a graphical representation of the IFIEs by representing each data point with colors that quantitatively and qualitatively reflect the IFIEs.

This method classifies structurally different ligands into functionally similar clusters according to the interaction pattern of a ligand and amino acid residues of a receptor protein. VISCANA also estimates docking conformation by analyzing patterns of the receptor-ligand interactions of some conformations through the docking calculations. VISCANA could be applied not only to the FMO method but also any molecular interaction system which can provide interaction energies or other properties of interest such as charge distribution.

3.2.4 iGEMDOCK: A Graphical Environment for Recognizing Pharmacological Interactions and Virtual Screening

iGEMDOCK (Fig. 8) is an extension of the original docking tool GEMDOCK developed by Yang et el. [26] which adds a post screening analysis method to the original docking algorithm (http://gemdock.life.nctu.edu.tw/dock/igemdock.php). GEMDOCK’s two key functions for VS are used: 1) the searching algorithm [49] and 2) the scoring function [50] which is based on an empirical energy function:

ligpre pharma

bind

tot E E E

E = + +

where E_bind is the empirical binding energy, E_pharma is the energy of binding site pharmacophores (hot spots), and E_ligpre is a penalty value if a ligand does not satisfy the ligand preferences. E_pharma and Eligpre are especially helpful in selecting active compounds from hundreds of thousands of non-active compounds by excluding ligands that violate the characteristics of known active ligands, thereby improving the selection of true positives.

Figure 8. The virtual screening and post screening analysis processes in iGEMDOCK

The integration of different-stage programs of VS environments into GEMDOCK constituted the emergence of iGEMDOCK for docking, virtual screening and post screening analysis of database compounds using a friendly interface. In post-screening analysis iGEMDOCK enriches the hit rate and derives pharmacological interactions from screened compounds to provide biological insights. The pharmacological interactions represent conserved interacting residues which form binding pockets with specific physico-chemical properties expressing the essential functions of the target protein.

This new algorithm provides both virtual screening and post screening analysis as well as a more detailed and complete understanding of ligand binding mechanisms which makes the study and discovery of lead compounds much easier and less time consuming than other similar post screening analyses. iGEMDOCK is based on the efficiency of GEMDOCK which was able to mine various inhibitors such as aurintricarboxylic acid tetracycline derivatives which inhibit flaviviruses [6] and influenza virus neuraminidase inhibitors [8].

3.3 Summary

Methods of post screening analysis that enhance virtual screening enrichment and retrieve target compounds more accurately are of great use and interest in current bioinformatics. In this review we summarized and compared methods of VS and post screening analysis of lead compounds which emphasize the relevance of interaction profiles in mining suitable candidates.

SIFt (structural interaction fingerprint) is one of the pioneer methods in post screening analysis to include interaction-specific information into the real number strings. This enables the visualization, organization, analysis and retrieval of structures containing key interactions or specific features. A combination of SIFt and ChemScore (an empirical scoring function) contributed to a modest increase in the enrichment factor (EF) which was calculated based on the ability to recover known inhibitors. The enrichment increased from 37.0 EF^a (SIFt) to 42. 3 EF^a (SIFt + ChemScore) [23].

VISCANA (Visualized Cluster Analysis of Protein-Ligand Interaction) uses a different approach through the FMO method. It has the advantage of describing the charge-transfer between a receptor and a ligand in comparison to a conventional force field method using fixed atomic charges. The difference between VISCANA and other conventional screening methods is that most methods choose the higher rank of a docking score on a point. In VISCANA a compound with a low docking score may belong to the same cluster that contains active compounds and the compound could be a suitable candidate. However, Amari et al. affirmed in their study VISCANA needs further development of quantum mechanical methods (the second-order Møller-Plesset perturbation theory based on the FMO method) to obtain more reliable descriptions of van der Walls interactions and hydrogen bonds which are important in determining receptor-ligand binding [22]. Other post screening studies reveal that unreliable or insufficient descriptions of important interactions account for increased numbers of false positives [48].

iGEMDOCK, an integration of VS and post screening methods is based on the original evolutionary docking algorithm GEMDOCK, currently one of the pioneer methods used for combining VS with visualizing, organizing, analysing and data mining of lead compounds. It has an advantage over SIFt and VISCANA primarily due to the attempt of eliminating two key issues: 1) if a docking tool is used for VS, which post screening analysis can complement it best and 2) if a post screening analysis method is decided, which docking tool or VS method is most suitable. The difference in the post screening approach of iGEMDOCK and other methods (VISCANA and SIFt) is the use of a module which clusters compounds based on interaction profiles and atomic compositions. Selecting representative compounds from each cluster enables the maintaining of compound diversity and reduces the number of false positives. In addition, its pharmacological scoring function can reduce the ill-effect of energy-based scoring functions

which often favor high molecular weight or highly-polar compounds. This improves the screening accuracy when the molecular weights of the active compounds are less than 400 Daltons (Da) [52]. Most notably, GEMDOCK, the earlier version of iGEMDOCK was used successfully to screen and identify inhibitors for influenza virus neuraminidases and flaviviruses [6, 8].

We also emphasize on the use of VS and post screening analysis in the mining of novel compounds for various other applications (e.g. industry, agriculture, cosmetics and nutritional supplements). These areas have not been getting much attention in comparison to drug design whereas certain protein-ligand complexes constitute key compounds in developing various biochemical products [29-31]. VS and post screening analysis used in computer-aided drug design reveal great potential in such applications since prospect candidates used in cosmetics and other industries may be retrieved employing interaction profiles.

Although the methods investigated in this study, SIFt, VISCANA and iGEMDOCK employ different techniques (structural interaction fingerprint, ab initio FMO method and interaction energy modules) they have one common feature; the use of protein-ligand interaction profiles which can be further exploited in developing new and improved methods to retrieve and analyze potential candidates for drug design and other applications. Through the development of better techniques, measures and description of interaction energies can aid methods of novel compounds retrieval and analysis, improve in accuracy and selection of active compounds. In addition, these observations point to an important aspect in the computer-aided drug design and discovery, the necessity for more than one stage of clustering in post screening analysis. From this point we proceeded with developing our new method Two-Stage Combinative Clustering (TSCC) [48] which combines our specifically optimized docking tool (GEMDOCK) with two stages of clustering for an optimized post screening analysis.

CHAPTER 4

TSCC: Two-Stage Combinative Clustering for Virtual Screening Using Protein-ligand Interactions and Physical-Chemical Features

4.1 Introduction

Continuous advancements in high-throughput X-ray crystallography and genomics [2, 28]

account for increased numbers of available crystal structures enabling a more rapid development of new therapeutic targets. However, prospect ligands and proteins need to be screened in order to downsize groups [22, 23, 53] and select suitable candidates for post-screening analysis.

Clustering methods based on structural similarity which are employed in post-screening analysis generally improve the scoring function performance. In developing methods for 3D compound retrieval, a detailed understanding of intermolecular interactions between proteins and their ligands is critical to structure-based inhibitor design. Various post-screening analysis methods and clustering [23, 54-56] employ RMSD values, protein-ligand interactions and computation and comparison platforms for measuring distances. Since the above methods as well as TSCC encounter challenges of specific selectivity and false positives, we aim to provide advantages to our post screening analysis method by using two combined clustering stages to rank all compounds and select final representatives more efficiently and accurately. The final representatives can be confirmed through bioassays to verify their target and the proper activity and application.

Although similar methods (IBAC, SIFt and VISCANA) [21-23] have used visualization and clustering of compounds to enrich VS, they have not identified novel compounds for any practical applications (drug design or industrial purposes). In addition, with the use of such

在文檔中資料探勘與篩選後分析方法於多方面生化應用化合物之研究 (頁 24-0)