Structure of the Model - Computational Details

4. Computational Details

4.3 Structure of the Model

Our fundamental idea is to use auto-encoder to extract latent features from different molecular representations. In other words, project molecular representation into more informative latent space.

Because we believe this latent space contains all the essential information about the molecule, we use these latent features to predict the molecular properties. Therefore, our structure of the model is made up of two parts. One is an auto-encoder to extract the essential features and the other is a neural network to predict logP form these latent features. We create two types of models using SMILES and σ-profile as input, and we call it SMILES_logP²⁹ and σ_logP model respectively. The difference between the two models is the input representation, the architecture of the auto-encoder, and the selection of the loss function.

Because SMILES representation expresses the sequential relationship and connectivity between atoms, so we choose RNN architecture which is suitable for handling sequential data as our auto-encoder in the SMILES_logP model. As for σ-profile, we simply choose DNN architecture as our auto-encoder.

Figure 12. SMILES_logP and σ_logP model.

When we encrypt SMILES into one-hot encoding then using auto-encoder to extract the latent feature, the value of each element in this vector actually represents the probability. So it is a classification problem and the loss function we chose is cross entropy. While the value of each element in the σ-profile vector represents the real molecular area. Therefore, we choose the mean square error as our loss function. In addition, the part of predicting logP from latent features in both models is a regression problem. So again, mean square error is the most suitable loss function. Besides, the optimizer we chose in this paper is Adam, which is the most commonly used algorithm with the best performance.

5. Results and Discussions

First, this paper uses SMILES_logP model on both GDB-11 and Sigma databases to verify the effect of molecular size distribution in the training and test set on the neural network. Besides, we use the ensemble method to solve overfitting problem. Furthermore, we also use SMILES_logP and σ_logP model on Sigma database to compare the performance of two different molecular representations.

In the following table, Loss_1 means the average loss when the auto-encoder extracting the latent features, while Loss_2 means the average loss of the neural network predicting logP from latent feature during the training process. As mentioned above, we choose cross entropy for SMILES auto-encoder but MSE for σ-profile autoauto-encoder. As for the performance on test data, we select root mean square error (RMSE) and correlation coefficient R² as our metrics. We expect the smaller the RMSE and the closer to 1 the R².

RMSE= √¹

N∑^N_i=1(y-ŷ)² (5)

R²=( n ∑(yŷ)-(∑ y)(∑ ŷ)

√[n ∑(y²)-(∑ y)²][n ∑(ŷ²)-(∑ ŷ)²]

)² (6)

where

y : predicted value 𝑦̂ : expected value

Table 2. Results of SMILES_logP model on GDB-11 database

Database Model Loss_1

(cross entropy)

Loss_2 (MSE)

RMSE R²

GDB-11 SMILES_logP 0.0018 0.1389 0.9461 0.4660

Figure 13. SMILES_logP model results on test data of GDB-11 database.

Table 3. Results of SMILES_logP model on GDB-11 database (ensemble)

Database Model Loss_1

(cross entropy)

Loss_2 (MSE)

RMSE R²

GDB-11 SMILES_logP Ensemble method 0.4076 0.8433

Figure 14. SMILES_logP model results on test data of GDB-11 database (ensemble).

Table 4. Results of SMILES_logP model on Sigma database

Database Model Loss_1

(cross entropy)

Loss_2 (MSE)

RMSE R²

Sigma SMILES_logP 0.0340 0.3478 0.5741 0.7989

Figure 15. SMILES_logP model results on test data of Sigma database

Table 5. Results of σ_logP model on Sigma database

Database Model Loss_1

(MSE)

Loss_2 (MSE)

RMSE R²

Sigma σ_logP 3.4306 0.2709 0.5177 0.8362

Figure 16. σ_logP model results on test data of Sigma database

5.1 Different Distribution of Molecular Size

From Figure 13., we find that the trained model will have different performance when predicting molecules of different sizes. For molecules having 8 heavy atoms, the predicted value and expected value roughly lie on the diagonal. While molecules having 10 and 11 heavy atoms have large deviation. The reason why the trained model performs best when predicting molecules of 8 heavy atoms is the training set we choose. We deliberately let the training set only contain molecules of 8 heavy atoms. So, we verify that machine really learns details and features only belong to molecules of such specific size. When we want to generalize the model to the molecules of different sizes, it falls. And the greater the difference in the size of molecules, the worse the performance.

On the other hand, from Figure 15., we find the trained model performs roughly the same when predicting molecules of different sizes, except some special outliers. Because we make the distribution of molecular size in the training set broader and more diverse. Besides, we check the distribution of molecular size in the training set and test set is roughly the same. Therefore, the machine can learn more diverse and general features that belong to all size of molecules and we can get a more general model.

5.2 Overfitting and the Ensemble Method

When the trained model cannot generalize to molecules of different sizes, it means the overfitting situation happened. The machine learns specific and limited features rather than diverse and general features. Fortunately, we can use the ensemble method to deal with this problem. We train four different models with the same data. The predicted results between each model have a small bias but a large variance. Then we average the predicted values from each model as the final result.

Theoretically, the averaging process can reduce the large variance and get better performance. From Figure 14., we find the ensemble method worked. We successfully solved the overfitting problem.

5.3 Different Type of Molecular Representations

We use two different types of auto-encoder on Sigma database. One uses SMILES as input and the other uses σ-profile as input. We want to verify which molecular representation can better express the molecules. Especially the electronic interaction, which determines most of the properties of a molecule. From Figure 15. and Figure 16., the results show that when we select σ-profile as our input, we can get better performance than SMILES. The reason is that σ-profile is obtained after preliminary quantum calculation. The electronic interaction of a molecule can be calculated form the type of atoms and connectivity between them after quantum calculation. SMILES only express the type of

atoms and connectivity between them, while σ-profile express the electronic interaction. Therefore, σ-profile can better represent a molecule than SMILES. From the perspective of machine learning, σ-profile is actually the results of SMILES after doing feature engineering by human intelligence.

The process discard redundant information and obtain more representative expression.

6. Conclusion

From above results, we come to a conclusion that the auto-encoder is a useful way to extract latent features from different molecular representations. Besides, using latent features to predict molecular properties is feasible. Furthermore, we believe the latent space contains all the necessary information about such molecule. Multiple properties can be bidirectionally transformed between each other.

We also verify that the machine learns from what we give to it. Therefore, the split of database is important. The model does not work if the training set contains only the specific size of molecules.

Before we actually utilize this model to predict the new molecules, we cannot confirm the size distribution of them. In most cases, it is assumed that this distribution is similar to the distribution of the test set. Therefore, we must make sure that the molecular size distribution in the training set is the same as test set. If the training set only contain specific molecules, the machine will learn limited features. As consequence, we cannot get general model, which means we have no way to predict molecules of different sizes. The trained model is useless.

If it has too many iterations during the process of optimizing parameters to minimize the loss function, the model may focus on the specific details only belong to training data. It may lead to an over-complex model and cause the over-fitting problem. In other words, we get a model that cannot be generalized in the trade-off between optimization and generalization. When this circumstance unfortunately happens, the ensemble method is a powerful method to deal with it. The reason why this method works is because overfitting is caused by an over-complex model. It has the characteristics of a low bias but a high variance. The process of averaging can reduce the variance and we expect to get better performance.

Moreover, using σ-profile as a molecular representation will have better performance than SMILES.

It is because σ-profile describes the electrostatic potential of the molecule by preliminary quantum calculation, while SMILES only express the type of atoms which made up the molecule and the connectivity between them. We know that most of properties of the molecule is determined by the electronic interaction of the molecule. Therefore, σ-profile is more physically meaningful and representative than SMILES. We know that feature engineering means discarding the redundant information then acquiring more representative expression. The process can be done by either human or artificial intelligence. Preliminary quantum calculation is feature engineering by human intelligence and the auto-encoder is feature engineering by artificial intelligence. The results show that if we do the feature engineering by human intelligence first, we can expect to get better performance.

7. Future Work

The concern and criticism about machine learning lie in the so-called black box. We cannot clearly indicate what the machine has learned and why it makes such predictions. We believe we can solve this problem by further investigating the latent space. As mentioned above, this space contains and implies all the necessary and essential information about the molecules. We borrow the concept from natural language processing. When translating between different languages, there is a technique using RNN to project each word to a high dimensional space which we call it the semantic space. Each corresponding word in different language will project to the same point in this space. Therefore, we are inspired by language translation and propose so-called molecular translation. We believe there exists a high dimensional space that contains all the necessary information of the molecule, for instance, the wave function in quantum mechanics and the partition function in statistical mechanics.

We believe that all the properties of a molecule/material can be properly derived from this space. We call this space the Schrödinger space. In this paper, we have used the auto-encoder to extract and project SMILES and σ-profile into Schrödinger space respectively. If we can prove that similar molecules will be in close positions in Schrödinger space no matter which representations or properties are used for auto-encoder to extract features, we can confirm that the machine really learned some underlying physical patterns and the assumption of Schrödinger space is feasible.

Therefore, the criticism of black box can be explained. Furthermore, we can even use multiple properties and representations simultaneously to acquire a more accurate Schrödinger space.

Therefore, we can effectively and bidirectionally convert between multiple molecular properties.

Figure 17. The concept of language and molecular translation.

8. References

[1] Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., & Feuston, B. P. (2003).

Random forest: A classification and regression tool for compound classification and QSAR modeling.

Journal of Chemical Information and Computer Sciences, 43(6), 1947-1958.

[2] Dudek, A. Z., Arodz, T., & Galvez, J. (2006). Computational methods in developing quantitative structure-activity relationships (QSAR): A review. Combinatorial Chemistry & High Throughput Screening, 9(3), 213-228.

[3] Varnek, A., & Baskin, II. (2011). Chemoinformatics as a Theoretical Chemistry Discipline.

Molecular Informatics, 30(1), 20-32.

[4] Varnek, A., & Baskin, I. (2012). Machine Learning Methods for Property Prediction in Chemoinformatics: Quo Vadis? Journal of Chemical Information and Modeling, 52(6), 1413-1437.

[5] Mitchell, J. B. O. (2014). Machine learning methods in chemoinformatics. Wiley Interdisciplinary Reviews-Computational Molecular Science, 4(5), 468-481.

[6] Ma, J. S., Sheridan, R. P., Liaw, A., Dahl, G. E., & Svetnik, V. (2015). Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships. Journal of Chemical Information and Modeling, 55(2), 263-274.

[7] Duvenaud, D. K., et al. (2015). Convolutional networks on graphs for learning molecular fingerprints. In: Advances in Neural Information Processing Systems (Vol. 28, pp. 2224-2232) [8] Rogers, D., & Hahn, M. (2010). Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling, 50(5), 742-754.

[9] Wu, Z. Q., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., . . . Pande, V.

(2018). MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2), 513-530.

[10] Jeon, W., & Kim, D. (2019). FP2VEC: a new molecular featurizer for learning molecular properties. Bioinformatics, 35(23), 4979-4985.

[11] Turney, P. D., & Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37, 141-188.

[12] Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2), 245-271.

[13] Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O., & Walsh, A. (2018). Machine learning for molecular and materials science. Nature, 559(7715), 547-555.

[14] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.

[15] Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117.

[16] Cortes, C., & Vapnik, V. (1995). SUPPORT-VECTOR NETWORKS. Machine Learning, 20(3), 273-297.

[17] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.

[18] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121-2159.

[19] Dietterich, T. G. (2000). Ensemble methods in machine learning. In J. Kittler & F. Roli (Eds.), Multiple Classifier Systems (Vol. 1857, pp. 1-15).

[20] Zhou, Z. H., Wu, J. X., & Tang, W. (2002). Ensembling neural networks: Many could be better than all. Artificial Intelligence, 137(1-2), 239-263.

[21] Williams, R. J., & Zipser, D. (1989). A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Computation, 1(2), 270-280.

[22] Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. Ieee Transactions on Signal Processing, 45(11), 2673-2681.

[23] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.

[24] Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504-507.

[25] Weininger, D. (1988). SMILES, A CHEMICAL LANGUAGE AND INFORMATION-SYSTEM .1. INTRODUCTION TO METHODOLOGY AND ENCODING RULES. Journal of Chemical Information and Computer Sciences, 28(1), 31-36.

[26] Klamt, A., & Schuurmann, G. (1993). COSMO - A NEW APPROACH TO DIELECTRIC SCREENING IN SOLVENTS WITH EXPLICIT EXPRESSIONS FOR THE SCREENING ENERGY AND ITS GRADIENT. Journal of the Chemical Society-Perkin Transactions 2(5), 799-805.

[27] Fink, T., Bruggesser, H., & Reymond, J. L. (2005). Virtual exploration of the small-molecule chemical universe below 160 daltons. Angewandte Chemie-International Edition, 44(10), 1504-1508.

[28] Fink, T. & Reymond, J. L. (2007). Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physico-chemical properties, compound classes and drug discovery.

Journal of Chemical Information and Modeling, 47, 342-353.

[29] Sattarov, B., Baskin, II, Horvath, D., Marcou, G., Bjerrum, E. J., & Varnek, A. (2019). De Novo Molecular Design by Combining Deep Autoencoder Recurrent Neural Networks with Generative Topographic Mapping. Journal of Chemical Information and Modeling, 59(3), 1182-1196.

在文檔中運用類自然語言處理之類神經網路預測分子與材料性質 (頁 20-0)