Stroke prevention by traditional Chinese medicine? A genetic algorithm, support vector machine and molecular dynamics approach†
Kuan-Chung Chen
aand Calvin Yu-Chian Chen*
abcdReceived 27th December 2010, Accepted 20th January 2011 DOI: 10.1039/c0sm01548b
Phosphodiesterase 4D (PDE4D) has been identified be a promising target which associate with stroke, which is one of the top 3 leading of death and main leading cause of adult disability in US. In this study, we applied virtual screening on the world’s largest traditional Chinese medicine (TCM) database (http://tcm.cmu.edu.tw;
1C. Y. C. Chen, PLoS One, 2011, 6, e15939) for natural compounds that inhibit PDE4D functions. Molecular docking and dynamics simulation were employed to investigate the protein–ligand interactions of compounds with top two dock scores. During the simulation, the divalent metal cations in PDE4D formed stable hydrogen bonds and electrostatic interactions between ligand and binding site residues. Furthermore, the two top TCM candidates, 2-O-caffeoyl tartaric acid and mumefural, formed additional steady hydrogen bond with binding site residue and active site residue respectively. The additional hydrogen bonds further stabilize protein-ligand interaction at the PDE4D binding site. To predict the bioactivity of the top TCM candidates, we built two prediction models using multiple linear regression (MLR) and support vector machine (SVM). The predicted pIC50 values suggest that 2-O-caffeoyl tartaric acid and mumefural are potential PDE4D inhibitors.
Introduction
Stroke and cerebrovascular diseases are the top 3 leading causes of death in U.S. in last several years.
2Moreover, stroke is the main leading cause of adult disability.
3Stroke is a multifactorial disease that can have both genetic and environmental contribu- tions. In the past few years, researchers have attempted to explore the polygenic factors behind stroke by using genome- wide association study; however, the immense amount of infor- mation stored in human genome has presented arduous logistical and technical challenges.
Recently, the Icelandic Decode group has identified a novel stroke-related gene, phosphodiesterase 4D (PDE4D), by using genome-wide association screen.
4Since the study from Gre- tarsdottir, many replication studies have been conducted using different population samples, including Staton,
5Sun,
6,7Nakayama,
8van Rijn,
9Saleheen,
10Nilsson-Ardnor,
11Bevan,
12and Brophy,
13in an attempt to examine the relationship between PDE4D and stroke.
PDE4D is a cyclic nucleotide phosphodiesterase which selec- tively hydrolyzese cyclic AMP (cAMP) to control the concen- tration of second messenger cAMP. The concentration of cAMP influences various cellular metabolisms. Decreasing in cAMP concentration could lead to increasing proliferation and migra- tion of smooth muscle cell that subsequently could result in development of atherosclerosis.
4Therefore, a PDE4D-targeting drug which is designed to obstruct PDE4D activation can reduce the risk of stroke, like the design shown in Fig. 1. This study
Fig. 1 Schematic diagram representing the proposed drug design strategy. By inhibiting PDE4D activity, the cAMP concentration will rise and hence prevent over proliferation of smooth muscle cell (SMC).
Thickening blood vessel wall that blocks blood flow and causes stroke can hence be prevented by regulating SMC growth.
a
Laboratory of Computational and Systems Biology, School of Chinese Medicine, China Medical University, Taichung, 40402, Taiwan. E-mail:
ycc@mail.cmu.edu.tw; ycc929@MIT.EDU; Tel: +1-617-353-7123
b
Department of Bioinformatics, Asia University, Taichung, 41354, Taiwan
c
Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA
d
Department of Computational and Systems Biology, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
† Electronic supplementary information (ESI) available. See DOI:
10.1039/c0sm01548b
ª The Royal Society of Chemistry 2011
Dynamic Article Links C <
Soft Matter
Cite this: Soft Matter, 2011, 7, 4001
www.rsc.org/softmatter PAPER
focused on identifying potential lead compounds that could act as PDE4D inhibitor.
Large structural diversity found in Chinese herbal ingredients and natural products poses a very valuable resource for finding novel lead compounds. Various natural compounds have been shown to be useful in anti-cancer,
14,15anti-radical, anti-inflam- matory and anti-platelet therapies.
16We, therefore, employed a traditional Chinese medicine (TCM) database (http://
tcm.cmu.edu.tw)
1and natural compounds for virtual screening by molecular docking. Molecular dynamics simulations were performed after docking to study receptor-ligand interactions and to determine stabilized conformations for the obtained protein-ligand complexes. Currently, bioactivity prediction using machine learning and building QSAR model is becoming a routine process in drug design; some novel examples include using support vector machine (SVM) for identifying bioactive natural products
17and building quantitative structure–activity relationship (QSAR) models for predicting inhibitory activity of small molecules.
18–27To predict the bioactivities of TCM candi- dates obtained from screening, we also constructed QSAR models using both multiple linear regression approach and SVM approach. The prediction resulted from multiple linear regres- sion (MLR) and SVM can further support the efficacy of the TCM candidates.
Materials and methods
Data set
The molecular simulations were performed by using Accelrys Discovery Studio 2.5 (DS 2.5). A total of 20,000 TCM compounds were downloaded from TCM database (http://
tcm.cmu.edu.tw/).
1All the downloaded TCM compounds had been minimized in MM2 force field using Chem 3D Ultra 11.0.
The crystal structure of PDE4D (PDB ID: 3G4G)
28was obtained from RCSB Protein Data Bank. Known PDE4D inhibitor, RS- 253344, which co-crystallized in the PDE4D cAMP binding pocket, was used as a control.
Molecular docking
The docking simulation was performed in the force field of CHARMm
29using LigandFit module
30in the Receptor-Ligand Interactions package of DS 2.5. The compounds from TCM Database were docked into the binding site of PDE4D which was defined by the volume of RS-25344 present in the initial crystal structure. Additionally, the inhibitor RS-25344 and cAMP, the normal substrate for PDE4D, were used as controls. The receptor was fixed in the docking protocol, but the ligands that complement the binding site were initially flexible but were later fixed for rigid-body energy minimization after docking. All chemical compounds and their possible poses were evaluated by scoring functions. In addition, top ranking compounds were visually inspected for binding poses, binding location and interacting residues. Our selection criteria specified the potential candidates to have similar binding characteristics and locations with the control molecules.
Dock Score that predicts ligand binding affinities by calcu- lating ligand internal energy and receptor-ligand interaction energy was used as the primary scoring function.
30Other scoring
functions, PLP, PLP2 and PMF, were also calculated for addi- tional references but were not used in selecting ligand and their poses. The PLP and PLP2 stand for piecewise linear potentials which predict protein-ligand binding affinities by calculating hydrogen bond interaction.
31These two versions of pairwise linear potentials differ in that PLP2 introduces atomic radii and angular components. The PMF (potential of mean force) predicts ligand binding affinities by summing all interatomic pairwise interactions of the receptor-ligand complex.
32Molecular dynamics simulation
The molecular dynamics (MD) simulation was performed using Standard Dynamics Cascade module and Dynamics (Produc- tion) module in the Simulation package of DS 2.5 with CHARMm.
29The complexes were created a 7 A solvation shell around the protein. Each system was also neutralized by addi- tional sodium cations. Two cycles of minimization have been performed, each using different system condition. First, the minimization was performed with the atoms of protein restrained. Afterwards, the minimization was performed without restraint. The minimization was performed using both Steepest Descent
33and Conjugate Gradient
34with maximum up to 6000 cycles. The SHAKE algorithm was applied to fix all bonds involving hydrogen atoms throughout the MD simulation, and the long-range electrostatics were treated with PME method. The time step was set to 0.001 ps for all MD stages. After minimi- zation, the protein-ligand complex was gradually heated to target temperature of 310 K from 51 K over an interval of 50 ps. After heating procedure and a 200 ps equilibration phase, the production stage was performed for 20 ns using NVT canonical ensemble. The temperature coupling decay time for the Berend- sen thermal coupling method was 0.4 ps. Post processing of the trajectory was done using Analyze Trajectory module at Simu- lation package of DS 2.5.
Support vector machine and multiple linear regression
To predict the bioactivity of the TCM compounds, we built activity prediction models using both SVM approach and MLR approaches. These two approaches differ in that SVM utilizes a non-linear feature mapping technique whereas MLR uses linear evaluation. The SVM model in the present study was built using LibSVM.
35Multiple linear regressions were performed using MATLAB.
The 62 compounds used in building the predictive model were complied from three different sources
36–38and were divided randomly into training set and test. These compounds were drawn using ChemBioOffice 2008 and prepared in Discovery Studio 2.5 using the Prepare Ligand module to modify the compound ionization state to physiological ionization setting.
The activity (IC
50) data used in QSAR were converted to pIC
50.
The molecular descriptors were calculated using Calculate
Molecular Properties in Discovery Studio 2.5, which could
calculate 552 different descriptors. To select molecular descrip-
tors that best represent the training set molecules, the genetic
function approximation (GFA)
39was used. The GFA algorithm
is based on genetic algorithm, which simulates a population of
strings based on the concept of natural evolution. In genetic
ª The Royal Society of Chemistry 2011
algorithm, selection, crossover and mutation events are itera- tively simulated for a specified number of generations. By applying genetic algorithm, the GFA searches all possible QSAR models and then uses square correlation coefficient (R
2) to esti- mate the fitness of individual model. The cross validation test is applied during building model.
Support vector machine (SVM) is a novel supervised machine- learning method, widely used in bioinformatics.
40–43SVM mini- mizes an upper bound of the generalization error on Vapnik- Chernoverkis dimension. This convention of minimization is superior in comparison to the traditional empirical risk mini- mization (ERM), which only minimizes the training error. In addition to binary classification, SVM can be used in regression of continuous data.
44This is called support vector regression (SVR). In our cases, the activities (pIC
50) of inhibitors are continuous. Hence, in order to build a model that can accurately predict inhibitor activity, we used SVR instead of SVM.
The basic idea in SVR is to find the hyperplane that best predicts data distribution by mapping the input data set into a high-dimensional feature space using Kernel function. For a given training data (x
1, y
1), ., (x
l, y
l) ˛ R
dR where y
idenotes the output values of the input vectors y
iand the data can be expressed in the form of f(x) ¼ <w,x> + b, w ˛ R
d, b ˛ R, the SVR finds w by introducing 3-insensitive loss function. This SVR formulation is
w;b x;x*
min 1
2 kw
2k þ C X
li¼1
x
iþ x
*is:t:
8 >
<
> :
y
ihw; x
ii bÞ # 3 þ x
ihw; x
ii þ b y
i# 3 þ x
*ix
i; x
*i$ 0; i ¼ 1; :::; l
9 >
=
> ; (1)
This optimization problem can be further transformed to include nonlinear functions by introducing Lagrange multipliers and kernel k(x,x
0) ¼ hf(x),f(x’)i, which gives
max:
8 >
> >
<
> >
> : -
12P
li;j¼1
ða
i-a
iÞða
j-a
jÞkðx
i; x
jÞ
-3 P
li¼0
ða
i-a
iÞ þ P
lj¼0
y
iða
j-a
jÞþ 9 >
> >
=
> >
> ;
s:t: P
li¼0
ða
i-a
iÞ ¼ 0 and a
i-a
i˛½0; C
(2)
The prediction for new input, in terms of Kernel function, is given by
f ðxÞ ¼ X
li¼1
ða
ia
iÞkðx
i; x Þ þ b,s:t: 0 # a
i# C; 0 # a
i# C (3)
The performance of SVR is significantly dependent on the setting of parameters: C, 3, the kernel type (mapping function) and the corresponding kernel parameters. In LibSVM, the available kernel functions include linear, polynomial, radial basis function, sigmoid function and precomputed kernel. We con- structed the SVR model based on radial basis function, and for our case, the key parameters are C cost, epsilon and gamma. The C cost sets the parameter C in SVR. The epsilon parameter sets 3 in 3-insensitive loss function. The gamma parameter sets the g
value in kernel function. To find the optima values for C, epsilon, and gamma, we used the gridregression.py program included in the libsvm-2.91 package.
Results and discussion
Molecular docking
After virtual screening, the scoring function results of the top 10 compounds and the controls are listed in Table 1. The structures of the top two candidates (2-O-caffeoyl tartaric acid and mumefural) and the two controls (cAMP and RS-25344) are shown in Fig. 2. Scoring functions, Dock Score,
30was used to rank the docking results. 2-O-Caffeoyl tartaric acid and mume- fural both scored higher Dock Score than cAMP and RS-25344 and were, therefore, selected for further analysis.
Two-dimensional representation of TCM ligand docking conformations is shown in Fig. 3. In addition, the docking diagrams showed two divalent metal cations, Zn
2+and Mg
2+, which have been suggested as cofactors in cAMP hydrolysis. As illustrated in Fig. 3, both Zn
2+and Mg
2+ions acted as bridges for coordinating ligand interaction to Asp367 and Asp484. All TCM ligands and controls had electrostatic interactions to these two divalent metal cations suggesting the importance of this inter- action in ligand binding in PDE4D and the dependency of metal ions in ligand inhibitory activity.
As shown in docking diagrams (Fig. 3), the TCM ligand, 2-O- caffeoyl tartaric acid, and two controls had pi-pi interaction with
Table 1 Docking results of top TCM compounds, RS-25344, and cAMP
Name Dock Score PLP1 PLP2 PMF
2-O-Caffeoyl tartaric acid 1,013 39.69 34.42 170
Mumefural 1,009 12.47 13.82 156
Cyclic AMP
a974 32.35 25.43 133
2-O-Feruloyl tartaric acid 948 22.77 39.86 187
Kainic acid 837 4.26 3.87 123
Gallic acid 698 28.72 33.29 98
3,5-Dihydroxycinnamic acid 686 16.93 23.24 102
Caffeic acid 686 16.93 23.24 102
2,3-Dihydroxycinnamic acid 676 3.09 8.13 110
Cinnamic acid 672 14.65 18.94 106
Ferulic acid 672 24.28 29.42 110
RS-25344
a250 97.29 73.57 164
a
Control.
Fig. 2 The scaffolds of (a) 2-O-caffeoyl tartaric acid, (b) mumefural, (c) cAMP, (d) RS-25344.
ª The Royal Society of Chemistry 2011
residue Phe538. This interaction kept 2-O-caffeoyl tartaric acid close to key binding site residue Gln535, thereby hindering binding of natural substrate to PDE4D (Fig. 3(a)). Moreover, control RS-25344 had a hydrogen bond with Gln535, which further supported the importance of blocking substrate from interacting with this residue (Fig. 3(d)). The second ranking TCM compound, mumefural, did not share pi-pi interaction as observed in other compounds. However, mumefural had a hydrogen bonding with catalytic residue His 326 (Fig. 3(b)).
This hydrogen bonding interaction, which completely blocks catalytic process, suggests an alternate strategy of inhibiting PDE4D, in addition to mechanism observed with 2-caffeoyl tartaric acid.
Molecular dynamics simulation
After docking simulation, we performed molecular dynamics simulation to further analyze the interactions between protein and ligand. Root mean squared deviation (RMSD) was calcu- lated to study atomic fluctuations during MD simulation. Fig. 4 shows the whole molecule RMSD values for the control-PDE4D complexes and the two TCM ligand-PDE4D complexes. The control-PDE4D complexes stabilized after 10 ns at 1.4 A while the 2-O-caffeoyl tartaric acid complex stabilized after 14 ns. The mumefural-protein complex, in contrast, appears to remain fluctuating even after 20 ns of simulation run. However, comparison of binding site snapshots taken at different time period showed that the position of the binding site residues were
relatively unchanged (Fig. 5(a)) and that all receptor-ligand interaction were conserved during simulation run. For these reasons, we did not continue production run for mumefural- PDE4D complex after 20 ns.
As for ligand RMSD, all ligands, except 2-O-caffeoyl tartaric acid, stabilized after 3 ns (Fig. 4). Ligand 2-O-caffeoyl tartaric acid had dramatic atom fluctuations from 15 ns to 17 ns.
Fig. 3 Docking poses of (a) 2-O-caffeoyl tartaric acid, (b) mumefural, (c) cAMP, (d) RS-25344 in PDE4D cAMP binding site. Electrostatic interaction is shown as dashed line, and pi-pi interaction is shown as solid line.
Fig. 4 RMSD of PDE4D in complex with 2-O-caffeoyl tartaric acid, mumefural, cAMP, and RS-25344 respectively (top), and ligand RMSD of 2-O-caffeoyl tartaric acid, mumefural, cAMP, and RS-25344 respec- tively (bottom).
ª The Royal Society of Chemistry 2011
However, as shown in Fig. 5(b), this large atomic fluctuation was due to intrinsic rotation of benzene substructure and did not influence the existed electrostatic interactions and hydrogen bonds.
Shown in Fig. 6 are snapshots of different receptor-ligand complexes taken at the end of molecular dynamics simulation.
After 20 ns of simulation, only RS-25344 maintained the pi-pi interaction to Phe538 (Fig. 6(a)); all other receptor-ligand pi-pi interactions diminished (Video S1, Video S2). RS-25344 was co- crystallized with PDE4D and therefore, could induce protein conformation change prior to crystallography process to accommodate for binding.
In contrast to pi-pi interaction, electrostatic interactions of ligands to binding site residues, Asp367 and Asp484, through divalent metal cations, were still retained after MD simulation run (Fig. 6). Interactions of Zn
2+to Asp379 and Asp484 as well as interaction of Mg
2+to Asp484 were conserved throughout the MD simulation run (Video S1, Video S2). Furthermore, as shown in Fig. 7, with the exception of RS-25344 to Mg
2+,
electrostatic interactions of ligands to both cations were stabi- lized and have only minor fluctuations.
For hydrogen bond interactions, the average length and occupancy during the simulation run are summarized in Table 2.
The lengths of different hydrogen bonds with respect to time are shown in Fig. 8. Although from analyzing ligand docking poses, there was no interaction between cAMP or 2-O-cafeoyl tartaric acid to Gln535, this interaction appears during the MD simula- tion for these two compounds. With the exception of mumefural, all compounds had hydrogen bond interaction to Gln535 (Fig. 8) with an average distance of 3.0 A. Although mumefural did not have hydrogen bonding with Gln535, it established stable, continuous hydrogen bonding with the key active site residue His326 (Table 2). In addition to ligand hydrogen bonding to Fig. 5 The docking poses of PDE4D with (a) mumefural, (b) 2-O-caf-
feoyl tartaric acid at different time points of MD simulation.
Fig. 6 The docking poses for PDE4D with (a) 2-O-caffeoyl tartaric acid, (b) mumefural, (c) cAMP, (d) RS-25344 at 20 ns of MD. Yellow line indicates electrostatic interaction with a distance less than 2.5 A (except for (d) which the distance between RS-25344 and Zn
2+is 2.72 A). The dashed line indicates hydrogen bond interaction and the solid line represents pi-pi interaction.
Fig. 7 Distances of electrostatic bonds between Mg
2+, Zn
2+and (a) 2-O- caffeoyl tartaric acid, (b) mumefural, (c) cAMP and (d) RS-25344.
ª The Royal Society of Chemistry 2011
Gln535 and His326, several hydrogen bonding were observed between ligands and Tyr325, Asp367 and Asn375 (Table 2).
Fig. 8 shows that both mumefural and cAMP had fluctuated hydrogen bond interactions with Tyr325 (Video S1) which could hold ligand close to active site His326. Both controls formed stable hydrogen bond interactions with Tyr495. In addition, 2-O- cafeoyl tartaric acid formed strong hydrogen bond interaction with Asn375 after a short period of molecular dynamics simu- lation. These interactions were proposed to further stabilize the two TCM ligands in PDE4D binding pocket.
Genetic function approximation, multiple linear regression and support vector machine analysis
To select the most representative descriptors for building QSAR, genetic function approximation was used. The selected
descriptors were Dipole mag, Dipole Y, Jurs FNSA, Jurs PPSA3, Jurs RASA, Jurs RNCG, Jurs TASA, Jurs WPSA3, shadow X length and shadow Y length. The values of descriptors for the training set and the test set were shown in Supplementary Table S1.
These selected descriptors can be broadly divided into two groups: the electronic descriptors and the spatial descriptors. The spatial descriptors include Dipole mag (Dipole magnitude) and Dipole Y (Dipole moment in Y dimension), which are both dipole moment descriptors computed from partial atomic charges and atomic coordinates.
45In our case, the numerical values of the training set and the test were in the range of 0.55285 to 7.37706 for Dipole mag and in the range of 4.25548 to 6.024515 for Dipole Y. Most of the spatial descriptors are Jurs descriptors, which is dependant on atomic partial charges and solvent-accessible surface areas of individual atoms under investigation.
46The Jurs RASA and Jurs TASA are related to hydrophobic surface area whereas Jurs RNCG is related to negative charge. The other three Jurs descriptors, Jurs FNSA3, Jurs PPSA3 and Jurs WPSA3, are significantly dependent on partial charge states. Both Shadow Xlength and Shadow Ylength are shadow indices. These two descriptors stand for length of molecule projected on X and Y dimension respectively.
47With these ten descriptors selected by GFA, an MLR model was built using a training set of 51 compounds. A test set of 11 compounds was used to evaluate the built model. The following linear model was obtained:
predicted IC
50¼ 23:43 þ 0:1383 Dipole mag þ 0:0921
Dipole Y 31:708 Jurs FNSA3 0:2866 Jurs PPSA3 þ 28:543 Jurs RASA þ 20:607 Jurs RNCG 0:0330
Jurs TASA þ 0:691 Jurs WPSA3 þ 0:428
Shadow Xlength þ 0:528 Shadow Ylength
The same training set and test set were used to build and to test the SVM model. The statistical significances of the MLR and the SVM model are shown in Fig. 9. The training set square Table 2 H-bond interactions of PDE4D with 2-O-caffeoyl tartaric acid, mumefural, cAMP, and RS-25344
aLigand H-bond Ligand Atom Amino acid Max. distance Min. distance Average distance H-bond occupancy
2-O-Caffeoyl tartaric acid 1 O6 Asn375:HD22 5.94 1.70 2.24 91.36%
2 O28 Gln535:HE21 3.64 2.29 2.92 1.05%
3 O28 Gln535:HE22 3.99 2.08 3.03 6.24%
4 O31 Gln535:HE21 5.05 2.49 3.31 0.15%
5 O31 Gln535:HE22 4.10 2.14 2.83 8.84%
Mumefural 1 O13 TYR325:HH 5.42 1.75 3.30 37.06%
2 O17 TYR325:HH 4.34 1.92 3.20 8.39%
3 H10 HIS326:NE2 2.63 1.73 1.99 99.85%
4 H10 ASP367:OD2 3.11 2.20 2.68 11.24%
cAMP 1 O15 TYR325:HH 5.28 1.94 3.63 11.74%
2 N8 Tyr495:HH 3.29 2.16 2.75 10.84%
3 H29 Tyr495:OH 5.67 2.20 3.81 11.79%
4 H30 Tyr495:OH 4.54 2.15 2.96 34.17%
5 H29 Gln535:OE1 4.20 1.90 3.20 5.99%
6 H30 Gln535:OE1 4.04 1.79 2.86 29.72%
RS-25344 1 O21 Tyr495:HH 3.80 2.31 2.97 0.55%
2 O21 Gln535:HE22 3.56 2.01 2.53 46.15%
a