Thesis Overview - LigSeeSVM：結合Support Vector Machines與資料融合在活性配體為基之藥物篩選及GPCR與GABAA之實際應用

Chapter 1. Introduction

1.2 Thesis Overview

We have used LibSVM 2.71 (http://www.csie.ntu.edu.tw/~cjlin/libsvm/), developed by Lin et al. [20], on virtual database screening. In chapter 2, we used 10 thymidine kinase substrates, 11 estrogen receptor antagonists, 10 estrogen agonist, 100 GPCR and GABAA

ligands and combined with 990 randomly chosen compounds from ACD or 7300 randomly chosen compounds from CMC to form fourteen screening sets. After preparing screening sets, we extracted two kinds of features to describe physicochemical and structural properties of compounds. Next, we used LibSVM 2.71 with different screening sets to evaluate the performance on virtual screening.

In chapter 3, we evaluated the screening performance of LibSVM with screening sets (TK, ER antagonists, ER agonists, GPCR, and GABAAligands) combined with 990 ACD randomly chosen compounds by the true hit, hit rate, goodness-of-hit (GH) and ROC curves. We used three combinations of atom pair descriptors only, physicochemical descriptors only, and rank combination to compare screening utilities with other ligand-based virtual screening approach.

The results showed that the rank combination was the most reliable.

In chapter 4, we applied the LibSVM with screening sets (TK, ER antagonists, ER agonists, GPCR, and GABA_A ligands) combined with 7300 CMC randomly chosen compounds. The results showed that the majority of compounds with high Z-score have structures similar to the known ligands. Some compounds with high Z-score but have different structures compared with the known ligands, and these compounds have more possibility to become novel, potential lead compounds.

Chapter 5 presented some conclusions and future perspectives. Our results suggest that SVM is practically applicable for virtual screening and offer competitive performance to other ligand-based virtual screening approach. Combination has great improved the result of drug screening approach and it could effectively reduce the number of false positives.

Chapter 2 Methods and Materials

The SVM method was proposed by Vapnik [21, 22] on the basis of the Structural Risk Minimization Principle [21, 23]. It was initially designed to solve pattern recognition problems [21, 24], but it was later applied to function estimation problems [21, 25]. The estimated function is a linear expansion in terms of functions defined on a certain subset of the data (support vectors), and the final number of coefficients in such an expansion does not depend on the dimensionality of the space of input variables. These two properties make SVM an especially useful technique for dealing with very large data sets in a high-dimensional space [21].

In this study, we used LibSVM 2.71 (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) [20]. Data examples labeled as positive or negative are projected into a high-dimensional feature space and the hyper-plane in the feature space is optimized to maximize the margin between the positive and negative [19]. The general flowchart of the virtual screening using LibSVM is shown in Figure 1.

2.1 Preparation of ACD Screening Databases

A. Thymidine kinase substrates and estrogen receptor antagonists and agonists

Two screening sets designed for virtual screening against TK and ERαantagonists were proposed by Bissantz et al. [26] in 2000 and retested by Jain in 2003 [27]. A TK library contained 10 known ligands of TK and 990 randomly chosen molecules from the ACD; an ERα-antagonists library contained 11 known antagonistsofERα and 990 randomly chosen

molecules from the ACD. The two sets were downloaded from

http://jainlab.ucsf.edu/Downloads.html proposed by Jain and the files in the SYBYL mol2 format were converted to the MDL mol format with Corina 3.0 [28] for running Accelrys Cerius² QSAR module [13]. Besides, the screening set for ERαagonists included 10 known agonists [29] and the same 990 molecules as the ERα-antagonists library.

B. GPCR and GABAAligands

Serotonin, muscarinic, histamine, and GABA_A ligands (shown in Figure 2) [3] combined with 40 randomly chosen molecules from the 990 ACD molecules that were used for SVM training set. A and B are serotonin ligands [30-32]. C (tolterodine), D, and E are muscarinic antagonists [33-35]. Molecules F－H (bromodiphenhydramine, pyrilamine, and azatadine) are H1 receptor antagonists. Molecules I － K are GABA_A receptor agonists (diazepam, alprazolam, and zopiclone).

To test the models that resulted from these input structures, GPCRDB (www.gpcr.org) [3, 36] was used to identify 85 GPCR ligands of widely varying chemical structure and 15 GABA_A agonists, of which nine were variations on the classic benzodiazepine scaffold and six were of varying chemotypes . Table 1 lists the names and annotated target specificity for all 100 molecules [3]. As a negative control for all cases, the 990 ACD randomly chosen molecules described above were used.

2.2 Preparation of CMC Screening Databases

The larger screening set was derived from the drug database, Comprehensive Medicinal Chemistry (CMC). Using ISIS, the CMC were first filtered with molecular weights between 200 and 500 to yield about 7300 molecules and then removed small fragments and added hydrogen atoms with Corina 3.0. The structure files of molecules were stored in the MDL mol

format for running Accelrys Cerius² QSAR module and in the SYBYL mol2 format for running AP generator.

A TK library contained 10 known TK substrates and 7300 molecules from the CMC; an ERα-antagonists library contained 11 known antagonistsofERα and7300 molecules from the CMC; the screening set of ERαagonists contained 10 known agonists and 7300 molecules from the CMC. Besides, the screening set of 85 GPCR and 15 GABA_Aligands also included the same 7300 CMC molecules.

2.3 Feature Extraction

A. Atom pair descriptors

Atom pair descriptors (AP) were obtained with the AP generator program developed in our laboratory using an approach proposed by Carhart et al. [12, 37]. Figure 3 shows the definition of atom pair descriptors. The key components for defining a set of atom pair descriptors include the definition of atom types and the topological distance bins. An atom pair is a simple type of substructure defined in terms of the atom types and the shortest path separation (or graph distance) between two atoms. The graph distance is defined as the smallest number of atoms along the path connecting two atoms in a molecular structure. The general form of an atom pair is as follows:

Atom type i --(distance)-- atom type j

where (distance) is the graph distance between atoms i and j in the case of a 2D atom pair description. (The distance can also be defined as the physical distance between atom i and j in the case of a 3D atom pair description.)

In this study, SYBYL atom types (mol2 format) were utilized as the starting point. In principle, all SYBYL atom types can be used in the generation of atom pair descriptors. In order to reduce the number of atom pair descriptors and improve the accuracy, we have clustered 23 atom types into 10 atom types (Table 2). The total number of pairwise combinations of all 10 atom types is 55. Furthermore, 15 distance bins were defined in the interval of graph distance rang from zero (i.e., zero atoms separating an atom pair) to 14. Thus, a total of 825 (55 × 15) atom pair descriptors were generated for each molecular structure [11].

B. Accelrys Cerius²QSAR module default and thermodynamic descriptors

The physicochemical descriptors of a compound are generated by using Accelrys Cerius² QSAR module. QSAR module provides a wide variety of descriptors, in this study, we used two functional families of descriptors, six descriptors in the thermodynamic family (Table 3) and 13 descriptors in the default family (Table 4) [13]. Each descriptor is briefly described as follows:

Sum of atomic polarizabilities (Apol): The sum of atomic polarizabilities (Apol) descriptor computes the sum of the atomic polarizabilities. The polarizabilities are caculated from the A coefficients used for molecular mechanics calculations:





i i

a A

For more information, see Marsali and Gasteiger (1980); Hopfinger (1973).

Dipole moment (Dipole): The dipole moment descriptor is a 3D electronic descriptor that indicates the strength and orientation behavior of a molecule in an electrostatic field. Both the

magnitude and the components (X, Y, Z) of the dipole moment are calculated. It is estimated by utilizing partial atomic charges and atomic coordinates. Partial atomic charges are computed using the charge setup option in the QSAR control panel offering CHARMm charging rules, Gasteiger, CNDO2, and Del Re methods. The descriptor uses Debyes units.

Dipole properties have been correlated to longrange ligand-receptor recognition and subsequent binding.

Radius of gyration (RadOfGyration): The radius of gyration is calculated using the following equation:

 





 



  





^x ^y_N ^z

Rog ⁱ ⁱ ⁱ

2 2 2

where N is the number of atoms and x, y, z are the atomic coordinates relative to the center of mass.

Molecular surface area (Area): The molecular surface area descriptor is a 3D spatial descriptor that describes the van der Waals area of a molecule. The molecular surface area determines the extent to which a molecule exposes itself to the external environment. This descriptor is related to binding, transport, and solubility.

Molecular weight (MW): Molecular weight.

Molecular Volume (Vm): A 3D spatial descriptor that defines the molecular volume inside the contact surface. The molecular volume is calculated as a function of conformation.

Molecular volume is related to binding and transport.

Density (Density): A 3D spatial descriptor that is defined as the ratio of molecular weight

to molecular volume. It has the units of g ml^-1. The density reflects the types of atoms and how tightly they are packed in a molecule. Density can be related to transport and melt behavior.

Principal moment of inertia (PMI): Calculates the principal moments of inertia about the principal axes of a molecule according to the following rules:

 The moments of inertia are computed for a series of straight lines through the center of mass. The moments of inertia are given by:





i i id m

I ²

 Distances are established along each line proportional to the reciprocal of the square root of I on either side of the center of mass. The locus of these distances forms an ellipsoidal surface. The principal moments are associated with the principal axes of the ellipsoid.

 If all three moments are equal, the molecule is considered to be a symmetrical top. If no moments are equal, the molecule is considered to be an unsymmetrical top.

For more information about this descriptor, see Hill (1960).

Number of rotatable bonds (Rotlbonds): Counts the number of bonds in the current molecule having rotations that are considered to be meaningful for molecular mechanics. All terminal H atoms are ignored (for example, methyl groups are not considered rotatable).

Hbond acceptor: Number of hydrogen-bond acceptors.

Hbond donor: Number of hydrogen-bond donors.

Energy: The energy of the selected compound.

Chiral centers: Number of chiral center (R or S) in a molecule.

AlogP and molar refractivity (MolRef): LogP (the octanol/water partition coefficient) and molar refractivity are molecular descriptors that can be used to relate chemical structure to observed chemical behavior. LogP is related to the hydrophobic character of the molecule.

The molecular refractivity index of a substituent is a combined measure of its size and polarizability.

The QSAR descriptor ALogP and molar refractivity are calculated using the method described by Ghose & Crippen (1989). In this atom-based approach, each atom of the molecule is assigned to a particular class, with additive contributions to the total value of logP and molar refractivity.

For more information, see Leffler and Grunwald (1963).

AlogP98: The AlogP98 descriptor is an implementation of the atom-type-based AlogP method using the latest published set of parameters (Ghose et al. 1998).

Desolvation free energy for water (FH2O) and octanol (Foct): Foct and FH2O are physiochemical properties associated with LFE models of a molecule. These properties have proven useful as molecular descriptors in structure–activity analyses. All LFE computations are based solely on the connectivity of the atoms in a molecule. LFE computations are not conformationally dependent.

F_octis the 1-octanol desolvation free energy and F_H2Ois the aqueous desolvation free energy derived from a hydration shell model developed by Hopfinger, where Foctand FH2Oare in kcal mol^-1.

For more information, see Hopfinger (1973; 1980) Pearlman (1980).

Heat of formation (Hf): The enthalpy for forming a molecule from its constituent atoms, a measure of the relative thermal stability of a molecule. This descriptor is calculated using the MNDO semi-empirical molecular orbital method of Dewar. MNDO is the most rigorous quantum-chemical technique available in QSAR+ and has a wide range of applicability in conformational analysis, intermolecular modeling, and chemical reaction modeling. The atom limit of MNDO is 300 atoms or 300 atomic orbitals (whichever is less) per molecule. The atoms treated by MNDO are: H, B, C, N, O, F, Al, Si, P, S, and Cl.

For more information, see Dewar amd Thiele (1977a; 1977b).

2.4 Build LigSeeSVM prediction Model

A. Divide data set into training set and testing set

To choose the known ligands used as the positive examples of the SVM training set, we calculated the molecular similarity. Euclidean distance was used as the measure of similarity in the multidimensional descriptor space. The latter distance d_ijbetween any two compounds i and j in N-dimensional descriptor space was calculated as

 







 ^N

jn in

ij X X

where Xin and Xjn are the values of nth descriptor for compounds i and j, respectively, and the summation is over all descriptors [38]. Based on UPGMA algorithm [39], we created a rooted tree of the known ligands. Compounds with the maximum distance (least similarity) were chosen as the positive examples of the SVM training set combined with randomly chosen molecules from 990 ACD molecules or 7300 CMC molecules as the negative examples. In the SVM training set, the ratio of the number of negative and positive data is about 20. For example, in the ACD screening set, the SVM training set consisted of two or three known

ligands as positive examples and 40 randomly chosen compounds from 990 ACD screening set as negative examples. The remained known ligands and 950 ACD molecules made up the SVM testing set. In the CMC screening set, we chose half of the known ligands as positive examples and 200 randomly chosen compounds from 7300 CMC screening set as negative examples to make up the SVM training set, and the SVM testing set was composed of the remained half known ligands and 7100 CMC molecules.

B. Add –b and –wi parameters

When training a SVM model, add –b parameter allows LibSVM to train a model for probability estimates. Next, we transfer the probability score to Z-score and sort according to Z-score. Besides, as we described earlier, in the SVM training set, the ratio of negative and positive data is about 20. In other words, our data is unbalanced, so we add –wi parameter to give the penalty for class “-1”is 2, larger than class “1”.

C. Rank combination

In this study, we combined the results of SVM-AP and SVM-PC using a “rank data fusion”

method in which a new rank is generated by sorting the average rank of SVM-AP and SVM-PC. Application of this hybrid method to each test set to improve the performance of both SVM-AP and SVM-PC predicted models.

Chapter 3 Evaluation LigSeeSVM on ACD Database

We have made seven screening sets against different target proteins, thymidine kinase (TK), estrogen receptor with antagonists (ER-antagonists), estrogen receptor with agonists (ER-agonists), serotonin receptor, muscarinic receptor, histamine receptor, and GABAA

receptor. Each screening set includes several known active ligands and 990 ACD randomly selected compounds. Five common metrics were used to evaluate the screening quality, including the true hit (the percentage of active ligands retrieved from the database), hit rate (the percentage of active ligands in the hit list), goodness-of-hit (GH), ROC curves, and false positive rate.

3.1 Thymidine Kinase

A. Preparation of screening data set

To evaluate the virtual screening utility of SVM, we used the thymidine kinase benchmark from Bissantz et al. [26]. It consisted of 10 known ligands (Figure 4) and 990 compounds ACD screening set mentioned above. We chose two known ligands (1kim.THM and 2ki5.AC2) based on the UPGMA rooted tree (Figure 5A) as positive examples of SVM training set and 40 randomly chosen compounds from 990 ACD screening set as negative examples. The SVM testing set including 8 remained known ligands and 950 ACD molecules.

B. Virtual screening of TK substrates

Table 5 shows the result of the remaining 8 TK known ligands in SVM testing set using different descriptors. The results of different descriptors are also shown in Figure 5.

Some common factors were used to evaluate the screening quality, including coverage (the percentage of active ligands retrieved from the database), yield (the percentage of active ligands in the hit list), false-positive (FP) rate, enrichment, and goodness-of-hit (GH). The coverage (true positive rate) is defined as Ah/A (%),Ah/Th (%) is the yield (hit rate), and the FP rate is defined as (Th－ Ah)/(T－ A) (%). The enrichment is defined as (Ah/Th)/(A/T). Ah is the number of active ligands among the Th highest ranking compounds, which is called the hit list, A is the total number of active ligands in the database, and T is the total number of compounds in the database. The GH score is defined as [40]

 

_

In the case of TK A and T are 10 and 1000, respectively.

The main objective of this study was to evaluate the behavior of different predicted models in virtual screening. Figure 6 shows the results of SVM-AP, SVM-PC, and LigSeeSVM. We tested SVM with different descriptors to evaluate the performance and the search behavior.

The screening quality generally improved in the case of LigSeeSVM. As shown in Figure 6A, the hit rates for different predicted models are 50% (SVM-AP), 40% (SVM-PC), and 72.7%

(LigSeeSVM) when the TP rate is 100%. In the case of LigSeeSVM and the TP rate is 100%, the GH score is 0.79 (Figure 6B) and the FP rate is 0.3% (Table 6).

Table 6 shows the comparative false positive rates on the same target protein and screening database at true positive rates 80% and 100% (Surflex [3]). For the true positive rate of 100%, the FP rate for LigSeeSVM is 0.3%. Besides, the FP rates for Surflex is 0.7%, SVM-AP is 0.8%, and SVM-PC is 1.3%. The performance of LigSeeSVM is better than Surflex.

3.2 Estrogen Receptor

A. Preparation of screening data set

We have applied SVM to virtual screening against ERα with a testing sets composed of 11 known antagonists of ERα (Figure 7), 10 known agonists of ERα(Figure 8) [29] and 990 randomly selected compounds from ACD (Available Chemicals Directory). As we described earlier, based on the UPGMA rooted tree (Figure 5B, Figure 5C), we chose two known ERα antagonists (EST02 and EST11) as positive examples of the SVM training set and 40 randomly chosen molecules from 990 ACD screening set, and the SVM testing set was composed of the remained 9 known antagonists and 950 ACD compounds. In the ERα agonists case, we chose three known ERα agonists (ESA01, ESA04, and ESA09) as positive examples of the SVM training set and 40 randomly chosen molecules from 990 ACD screening set, and the SVM testing set including remained 7 known agonists and 950 ACD molecules.

B. Virtual screening of ERα antagonists and agonists

Table 7 shows the result of the ERα antagonists and Table 9 shows the result of ERα agonists in SVM testing set using different predicted models. The results of different predicted models are also shown in Figure 9(ERα antagonists). and Figure 10(ERα agonists). As shown in Figure 9A, the hit rates of ERα antagonists for different predicted models are 23% (SVM-AP), 17.3% (SVM-PC), and 69.2% (LigSeeSVM) when the TP rate is 100%. In the case of LigSeeSVM and the TP rate is 100%, the GH score is 0.77 (Figure 9B) and the FP rate is 0.6% (Table 8). Table 8 shows the comparative false positive rates on the same target protein and screening database at true positive rates 80% and 100%. For the true

positive rate of 100%, the FP rate for LigSeeSVM is 0.6%. Besides, the FP rates for Surflex is 13.4%, SVM-AP is 3.2%, and SVM-PC is 4.5%. The performance of LigSeeSVM is much better than Surflex. For the ERα agonists case, as shown in Figure 10, the hit rates for different predicted models are 15.9% (SVM-AP), 28% (SVM-PC), and 100% (LigSeeSVM) when the TP rate is 100%. Table 10, 11 shows top ranked 10 compound structures in the case of ERα agonists.

3.3 GPCR and GABAAReceptor

A. Preparation of screening data set

To compare with Surflex-Sim [3], we used the same molecules (Figure 2) which described in the paper as positive examples of SVM training set and combined with 40 randomly chosen compounds from 990 ACD screening set as negative examples of training set. A and B are serotonin ligands. C (tolterodine), D, and E are muscarinic antagonists. Molecules F－H (bromodiphenhydramine, pyrilamine, and azatadine) are H1 receptor antagonists. Molecules I

－ K are GABAA receptor agonists (diazepam, alprazolam, and zopiclone). The same molecules (Table 1) which described in the paper [3] also used to test the SVM models. The serotonin SVM testing set including 29 known serotonin ligands and 950 ACD molecules.

The muscarinic SVM testing set was composed of 43 known muscarinic ligands (Table 1) and 950 ACD molecules. Again, the histamine SVM testing set consisted of 49 known histamine ligands (Table 1) and 950 ACD molecules. The GABA_ASVM testing set including 15 known

在文檔中 LigSeeSVM：結合Support Vector Machines與資料融合在活性配體為基之藥物篩選及GPCR與GABAA之實際應用 (頁 13-0)