Organization of Thesis - 資料探勘與篩選後分析方法於多方面生化應用化合物之研究

Chapter 1. Introduction

1.3 Organization of Thesis

This thesis is organized as follows: In chapter 2 we describe related studies and similar methods of mining and analyzing prospect compound candidates from virtual databases along with their advantages and shortcomings. In Chapter 3 we perform an in-depth study of protein-ligand interaction profiles and present novel concepts obtained from our investigations in possible future work for additional applications of virtual screening and post screening analysis such as cosmetics, nutrition, industry and agriculture. In chapter 4 we describe our core work, the

development of Two-Stage Combinative Clustering (TSCC) and its improvement over one-stage post screening analysis methods. Chapter 5 concludes our studies and includes future work prospects. In Figure 2 below, the model for this research is presented.

Figure 2. The overall research process in investigating of interaction profiles and their role in identifying suitable methods for lead compounds retrieval and their applications

Chapter 2 2.1 Related Studies

The process of VS and post screening analysis is a common technique used in mining and analyzing compound candidates to be used in pharmaceutics or various other applications after their retrieval from databases. The VS technique involves docking tools (e. g. DOCK, GEMDOCK or GOLD) [19, 20, 26] to screen compound databases and rank compounds according to their binding energies. Compound databases store solved crystal structures (Figure 1) of chemically significant compounds which can be used in various studies (e.g. drug design, nutrition and other industries) [6, 9, 10, 29 – 31]. VS and docking is followed by post analyses using clustering (SIFt and VISCANA) [22, 23] which aim to reduce the number of false positives obtained from VS and propagate true positives to the top of the selection list.

2.1.1 The emergence of Post Screening Analysis

In the early days of computer-aided drug design, docking tools / programs were the only means of screening compounds for the possibility of drug design. Given the poor understanding of many critical factors at the time especially the incomplete knowledge of ligand binding mechanisms, VS was still a major accomplishment in moving forward a revolution in drug design and discovery with faster and more practical preliminary approaches than previously done through bioassays using biochemical methods. Traditional settings, in addition to requiring an extensive period of time to study various properties and make a drug ultimately available, had overwhelming expenses inquired through the use of conventional biochemical compounds, facilities and specimen. With the advent of computer aided drug design more of the preliminary work in drug design is done in virtual labs and when desired results are obtained, the stage requiring bioassays to confirm preliminary results is applied.

Most docking programs [19, 20, 26] use energy-based scoring methods which are often biased towards selection of high molecular weight compounds and charged polar compounds (Fig, 3). Therefore, they have problems identifying key features (e.g. hot-spots) essential to target protein responses resulting in the performance of these scoring functions to be mostly inconsistent when conducting a database search [3, 11]. The inaccuracy of various scoring methods inadequately predicting the true binding affinity of a ligand for a receptor is a major weakness for VS. Moreover, employing VS [2, 3] in computer-aided drug design usually results

in a high number of chosen compounds from which few are potential or suitable candidates.

Thus, it is imperative that a post screening analysis is conducted in order to reduce the number of false positives in the selection lists generated from VS and to propagate true hits to the top of the selection lists.

Figure 3. The biased ranking of compounds in virtual screening (molecular docking). Unknown compounds MFCD00012401 (green color) MFCD00013358 (teal green color) are ranked much higher than Vitamin D3 (ranked 816) due to their energy and molecular weight. However, only vitamin D3 is known for its ability to bind to the target protein (β-LG) [66, 67] among all compounds listed in this table.

2.1.2 Interaction-Based Accuracy Classification (IBAC)

IBAC is an approach developed by Kroemer et al [21] which determines the best way to assess correctness of docking conformations. It first calculates the RMS deviation of the predicted pose from the crystal structure and then it compares the predicted pose to the pose experimentally observed. In simple terms, using IBAC, Kroemer et al optimized the binding site definitions and docking protocols for 6 VS programs used in their studies (FlexX [32], GOLD [20], ICM [33], LigandFit [34], NWU [35, 36] and QXP [37]). They executed docking runs and

reported details of the ligand tautomeric forms and bond orders and how RMSDs from crystal structures correlated with interactions-based accuracy classifications. Kroemer et al. concluded that RMSD values alone lack the ability to predict correct poses and binding modes should be investigated further for specific interactions when assessing pose prediction accuracy. Through the work of Kroemer et al. the relevance of interaction profiles emerged as the foundation of interaction and bindings studies for protein-protein and protein-ligand complexes.

2.1.2 Structural Interaction Fingerprint (SIFt)

SIFt [23] uses a simple, generic and robust approach for representing and analyzing 3D protein-ligand interactions. Its key feature is the generation of an interaction fingerprint that converts 3D structural binding information into a one-dimensional (1D) binary string (Figure 9).

The fingerprint representation of the interaction patterns is compact, and allows for rapid clustering and analysis of large numbers of complexes. The SIFt is calculated on a set of input 3D protein–small molecule complexes. To analyse SIFTs the Tanimoto coefficient (Tc) [38] is used as the quantitative measure of bit string similarity.

This representation of interactions as fingerprints using the SIFt method enables clustering, filtering and profiling of large docking results libraries and crystal structures of the protein kinase family in complexes with various inhibitors. Although SIFt opened a broad road for post screening analysis, much of the road is still unpaved and difficult to travel in terms of methods used currently in post screening analysis.

2.1.3 Visualized Cluster Analysis of Protein-Ligand Interaction

VISCANA [15], a method which stands for Visualized Cluster Analysis of Protein-Ligand Interaction based on the ab Initio Fragment Molecular Orbital Method (FMO) [24] used for virtual ligand screening was proposed by Amari et al. They developed a cluster analysis using the dissimilarity defined as the squared Euclidean distance between interfragment interaction energies (IFIEs) of two ligands. In VISCANA a clustering method is combined with a graphical representation of the IFIEs by representing each data point with colors that quantitatively and qualitatively reflect the IFIEs. This method claims to classify structurally different ligands into functionally similar clusters according to the interaction pattern of a ligand and amino acid residues of a receptor protein. VISCANA also estimates the docking

conformation by analyzing patterns of the receptor-ligand interactions of some conformations through the docking calculations.

However, as stated by Amari et al. in their study, VISCANA lacks sufficient descriptions of van der Waals forces and hydrogen bond interactions which play an important role in receptor-ligand binding [39, 40]. This may account for selection of false positives instead and the failure to select true hits or active compounds. This method is aiming to increase VS enrichment;

however, it doesn’t provide significant improvements over SIFt or extend further uses into drug design and discovery or other possible applications.

2.1.4 A New Hierarchical Clustering Approach for Large Compound Libraries:

NIPALSTREE

NIPALSTREE, is an approach by Bocker et al [25] for clustering large datasets of virtual compounds in a high dimensional space. It uses the first Principle Component (PC) which employs NIPALS (non-linear iterative least squares) where the data set is split at point i or j (determined points where two neighbors exceed a predefined distance threshold T). The procedure is recursively applied on the resulting subsets until the maximal distance between cluster members exceeds a user-defined threshold. NIPALSTREE clustering employs PCA for hierarchical clustering algorithm as follows: A d-dimensional descriptor matrix is projected onto the first PC. Based on the scoring vector S, the given descriptor matrix is sorted in ascending order and split at the median position, i.e., two equally large descriptor sets-from now on termed

“left” and “right” submatrix s are created. This is repeated for the new subsets until the maximum distance between the entries in a submatrix underscores a predefined similarity threshold (Θ). In order to judge the quality of a clustering result an index is introduced to assess whether molecules interacting with the same target (receptor or receptor family) lie in the same subtree. An enrichment factor (EF) is calculated for each cluster, which gives an estimate of how well compounds that bind to the same target (or target class) are clustered in a dendrogram node i expressed in the following equation:

N_i,c being the number of entries in node i belonging to class c, Ni being the total number of

entries in node i, Nc being the total number of entries of class c in the data set, and N being the overall number of entries. EF > 1 indicates that more compounds belonging to the activity class c are clustered in a tree node than expected from an equal distribution. The EF value depends on the size of the dendrogram section under consideration: On the upper dendrogram levels, where clusters are large, EF values are usually smaller, whereas EF values on the lower dendrogram level scan get large without a statistical relevance. A possible way to overcome the cluster size dependency of the EF is to additionally divide it by the logarithm of the dendrogram level, assuming that at each cluster the data set is separated into equally large partitions. In this way, an adoption of the EF to the dendrogram level can be achieved.

Although NIPALSTREE is able to deal with more than 800 000 data points in high-dimensional descriptor space in less than an hour computation time it does not specify how false positives are addressed; this is a major concern for all methods performing compound retrieval and analysis. Besides a rapid clustering of compounds, NIPALSTREE cannot offer visualization and accurate data mining of compounds and it is impractical as a method of retrieval and analysis for specific compounds in either drug design or other industrial uses.

2.2. The Use of Protein-Ligand Interaction Profiles in the discovery of Molecular Mechanisms and Lead Compounds

Since protein-ligand and protein-protein complexes are components of a great number of pharmaceutical [5, 41], nutritional [10] and industrial compounds [29-31] it is reasonable to employ computer-aided lead compound design and discovery methods for other applications besides pharmaceutics. Due to its significant role and impact on the quality of human life, drug design was the main focus in early days of virtual screening and bioinformatics. However, as methods and studies in drug design reveal that VS and post screening analysis are relatively inexpensive and efficient we want to explore the other fields (nutrition, agriculture and industry) which were not given as much attention. Protein-ligand complexes of various compounds interact through similar properties [40] and necessitate similar methods of screening, retrieval and analysis of their crystal structures (Figure 1) regardless what their final application may be.

Therefore, the first part of this research focuses to conduct comparative studies on features and properties of protein-ligand interaction profiles to better understand their relevance in the mining of novel compounds. Additionally, we investigate possibilities of employing interaction profiles in the mining of compounds to be used in other applications besides drug design such as

cosmetics, skin care, nutrition, safe fertilizers and pesticides, compounds for scents in perfumes and deodorants and safe detergents. Furthermore, we employ interaction profiles in investigating mechanisms of significant molecules for human health and nutrition (e.g. uptake of vitamin D in the human body by Betalactoglobulin).

Although the interest of researchers in mining novel compounds for other uses besides pharmaceutics is minimal at the present time, as computer-aided methods continue to improve and increase in use, other industries (e.g. cosmetics, agriculture, nutrition) look to employ their benefits. Therefore, the approaches and techniques used in computer-aided drug design can be of particular interest for different biotechnological approaches. VS combined with post screening analysis are seemingly efficient in investigating transporter proteins such as β-lactoglobulin (β-LG), their mechanisms and various functions in the human body. Many compounds having various functions and mechanisms in the body are protein-ligand complexes which can be investigated based on protein-ligand interactions and physico-chemical features.

CHAPTER 3 The Relevance of Protein-Ligand Interaction Profiles in Computer-Aided Lead Compound Discovery, Functions and Applications

3.1 Introduction

Identification of protein-ligand interaction networks on a proteome scale is crucial in addressing a wide range of biological issues such as correlating molecular functions to physiological processes and designing safe and efficient target compounds which can be used in therapeutics, nutrition, cosmetics, skin care products, agriculture and industry. In order to understand the role and significance of protein-ligand interactions (Fig. 4) in various applications throughout the field of bioinformatics and biotechnology the properties and functions of a ligand [42, 43] must be well addressed. As seen previously, the ligand (vitamin D, Fig. 1) is a molecule, ion or atom which can bind to a specific location or the binding site of a protein [39, 44].

Currently, antibodies are the most commonly used ligands in biotechnology and life-science investigations, although protein scaffolds (protein regulators), nucleic acids and peptides (repeating structural units in amino acids) are also employed. Since protein-ligands complexes of various compounds are used in cosmetics, hair dyes, skin care products, fertilizers, detergents [29-31] and nutrition supplements [10], protein-ligand interaction profiles and physico-chemical features could be used in the identification of such lead compounds.

a b

Figure 4. View of protein-ligand binding interactions in Betalactoglobulin (a transporter protein) complexed with vitamin D using Swiss PDB viewer. a) Electrostatic potential and molecular surface. b) Hydrogen bond interactions among atoms (green dotted lines).

The ligand binding site of the primary target is extracted or predicated from a 3D experimental structure or homology model of proteins [35, 45] and characterized by a geometric potential. Protein-ligand interactions occur when a ligand binds to a protein which is usually integral to the function of its cognate (assimilated or symbiotic) protein. In the binding of a ligand to a protein, the following interactions are of significance: electrostatic forces (interaction between electrically charged particles explained by Coulomb’s law), van der Walls forces (the sum of the attractive or repulsive forces between molecules or parts of the same molecule) and hydrogen bonding (the attractive interaction of a hydrogen atom with an electronegative atom which can occur inter or intramolecularly) [39, 40]. Based on these interactions, evaluations are made using ligand-based approaches employed commonly in pharmacophore modeling by using physical and chemical traits of known ligands to identify novel inhibitors. Another approach, the receptor-based, identifies ligands that use structural and other features on the target receptor to identify the best inhibitor.

Docking [18, 26, 32, 33, 46] is then used to identify the fit between a receptor and the potential ligand by screening a database of ligands against one or more target receptors via two distinct parts: docking (the search scheme to identify suitable conformations or poses) and scoring (a measure of the affinity of various poses). Scoring methods must discriminate between non-native docked conformations and correct binding states of compounds during molecular docking phase to distinguish active compounds (usually a small number) from non-active compounds (an extremely large number) during the post-docking analysis. Although there are over 60 docking programs and tools available [24], we present some of the most popular programs made publicly available (Table 1). DOCK [18], incremental construction (FlexX) [32]

and evolutionary algorithms (GEMDOCK, GOLD, AutoDock) [26, 33, 46] are used to screen and downsize compound groups in order to select suitable candidates for post-screening analysis.

However, inconsistencies in the performance of scoring functions results in inadequate prediction of true binding affinity of a ligand to a receptor; thus, combining various scoring methods in VS may improve performance than in the average individual scoring functions.

Similar inconsistencies have been noticed in information retrieval (IR) and Charifson et al. [15]

proposed a study in which they used an interaction-based consensus approach to combine scoring functions which revealed enrichment in discrimination between active and inactive enzyme inhibitors. Studies by Bissantz et al. [3], Stahl and Rarey [11] and Verdonk et al. [16]

showed works on consensus scores which further improved VS enrichment. However, the remaining issue for VS users rather than researchers is when and how these scoring functions should be combined in either drug design or industrial compounds design.

Docking programs URLs REFERENCES

DOCK http://dock.compbio.ucsf.edu/ 18 FlexX http://biosolveit.de/flexx/index.html?ct=1 32 AutoDock http://autodock.scripps.edu/ 46 GEMDOCK http://gemdock.life.nctu.edu.tw/dock/igemdock.php 26

GOLD http://www.ccdc.cam.ac.uk/products/life_sciences/gold/ 33

Table 1. Popular docking tools and evolutionary algorithms currently used in VS

Furthermore, certain VS methods can identify important interactions or binding-site hot spots obtained from known active ligands and target proteins [17]. However, due to biases towards higher molecular weight and charged polar compounds [18] docking alone is not sufficient to analyse, determine and retrieve the most adequate lead compounds therefore post screening analyses are emerging as useful methods to aid with further elimination of false positive hits obtained from VS.

Methods for post-screening analysis employing clustering to identify key features obtained via docked compounds and the understanding of binding mechanisms are of great use in bioinformatics. Therefore, computer-aided drug and industrial target design require VS as a primary step to generate interaction and structure profiles followed by post screening analysis for adequate filtering, visualization and mining of the final candidates.

3.2 The Significance of Protein-Ligand Interaction Profiles in Methods of Compound Retrieval and Post Screening Analysis

Interactions between molecules (Fig. 4) are important for understanding many biological phenomena. From gene expression to enzyme reactions, the activities are dictated by molecular interactions. Because of DNA microarray success, researchers are studying the protein counterpart in greater detail [47]. Protein microarray can be used for studying a variety of

biological phenomena such as interactions of protein-ligand, protein–protein, antibody–antigen, protein–DNA, analysis of subunits in protein complexes, screening of target proteins expressed from phage library, analysis of mutant proteins, quantitative assay, discovery of diagnostic markers, analysis of protein expression profiles, development of diagnostic microarray and development of microarray-based lead screening system. The interactions of significance in analysis and retrieval of lead compounds for drug design are intermolecular interactions such as van der Walls forces, electrostatic forces and Hydrogen bonds interactions [39, 40]. Also called interaction energies, they can be obtained from virtual screening of docked compounds calculations [13]. The calculations of interaction energies are organized into data sets of interaction profiles (IPFs) and can be used as one of the criteria in a cluster analysis to further filter out and select more specific or the final target compounds. Thus, cluster analysis of various compounds with similar interaction energies will group the various compounds into separate clusters from which a representative is chosen usually based on RMSD values while undergoing what is termed a post screening analysis.

3.2.1 Post Screening Analysis

Methods of post screening analysis [21-23] are designed to facilitate the visualization (interpretation of binding interaction), organization (cluster and organize structures in a meaningful way), analysis (compare and profile the binding interactions of different structures) and data mining (search for structures containing key interactions or specific features) of virtually screened compounds. As mentioned earlier, binding interactions [39] (e.g. van der Walls forces, electrostatic forces and hydrogen bond interactions) of protein-ligand complexes are a critical part of mining and selecting the target representatives in post analysis methods.

Descriptions of binding interactions and interaction strength measures for protein-ligand complexes are very important for better mining of appropriate candidates from selection lists generated by VS [48]. Thorough an in-depth study of protein-ligand interactions in various post screening analysis, we attempt to develop an integrated method of VS and post screening analysis in order to speed up the screening and analysis of compounds, generate better interaction-specific information and to obtain suitable representatives. The overall details of this study are shown in Figure 5.

Figure 5. Methods from previous works investigated and our studies done in the designing of our

在文檔中資料探勘與篩選後分析方法於多方面生化應用化合物之研究 (頁 16-0)