Chapter 4 Discussion
4.3 Limitations
The rare and imbalanced data are always the biggest challenge for machine learning.
Here we proposed a method to increase the training data and proved that the artificial data could be used for machine learning. But actually we were not sure the OAs is normal distribution or not. So if the data distribution is not fit our hypothesis, the training dataset may not work for the training model. The ideal case is we can get enough data to train the model.
Currently, our model works for binary classification on disease discrimination and multiclass single-label classification on disease categorization. But for some
complicated cases, like below chromatogram showed, it combines MMA and MDS that is over our model’s capability to recognize two disease in one sample.
Figure 19 Complicated OA case
Chromatogram like fingerprint, there is no same chromatogram even use the same urine sample in the same column in the same GC-MS machine, which makes it difficult for machine learning. And the organic acidemias are rare disease with very limited data, so make this study more challenge. Although individual cases of organic acidemias are rare, overall they account for a substantial number of inherited metabolic disorders.
Organic acidemias and aminoacidopathies include a variety of inborn errors of metabolism that are caused by defects in the intermediary metabolic pathways of carbohydrates, amino acids, and fatty acid oxidation. These defects can lead to the abnormal accumulation of organic acids and amino acids in multiple organs, including the brain. Early diagnosis is mandatory to initiate therapy and prevent permanent long-term neurologic impairments or death. All this facts make this study difficult、
challenge but worthy and important.
4.4 Future works
The results presented in the previous section reveal that in the three machine learning models of 13 kinds OA classification, DNNs achieved the highest F1 score 0.90 and AUC(micro) 0.98, it showed most promise in detecting OA pattern in urine samples, demonstrating that the OA compound patterns can be learnt by deep neural networks from the GC-MS raw data effectively.
Because the compounds in DNN dataset are limited in 34 kinds, the AADC、AKU、
DA and MSUD are excluded due to their target compounds are not in the 34 list compounds. We may add their target compounds in the future GC-MS test process, then we can test more OAs in one time.
Current augment method for CNN just shift, shear or zoom the original images and it doesn’t work well. We may try to use SMOTE (Synthetic Minority Over-sampling Technique) or other augmentation algorithm in the future.
Enhance our computing resource is also feasible, due to our study has some limitation (both time and computing power) on some complicated network architecture. We have applied NCHC (National Center for High-Performance Computing) free resource for further study.
4.5 Conclusion
These results are encouraging, but become useful only when a high specificity, i.e.
low false positives, is also observed in the confusion matrix. (Skarysz, Alkhalifah et al.
2018)
Machine learning, and especially the neural networks, were applied to learn and detect OA compounds directly from raw GC-MS data. Due to the high variability, noise,
and high dimensionality of GC-MS data the application of machine learning presents considerable challenges. (Skarysz, Alkhalifah et al. 2018) As far as we know, this is the first trial to use neural network to discriminate and classify organic academia patterns from raw GC-MS data. The complex and noisy patterns present in GC-MS data, derived from urine samples and collected in clinical cases in NTUH database, were used to train convolutional neural network, deep neural network, support vector machine and random forest classifier. The deep neural network achieved the best performance. Additionally, the proposed methodology can be used to speed up diagnostic processes. The proposed approach has the potential to significantly contribute to the development of a diagnostic platform to detect various diseases quickly, efficiently, and reliably.
Chapter 5 References
Baumgartner, M. R., et al. (2014). "Proposed guidelines for the diagnosis and
management of methylmalonic and propionic acidemia." Orphanet J Rare Dis 9: 130.
Bertini, I., et al. (2009). "The metabonomic signature of celiac disease." J Proteome Res 8(1): 170-177.
Chang, C.-C. and C.-J. Lin (2011). "Libsvm." ACM Transactions on Intelligent Systems and Technology 2(3): 1-27.
Cortes, C. and V. Vapnik (1995). "Support-Vector Networks." Machine Learning 20(3):
273-297.
Cuperlovic-Culf, M. (2018). "Machine Learning Methods for Analysis of Metabolic Data and Metabolic Pathway Modeling." Metabolites 8(1).
Date, Y. and J. Kikuchi (2018). "Application of a Deep Neural Network to
Metabolomics Studies and Its Performance in Determining Important Variables." Anal Chem 90(3): 1805-1810.
Dionisi-Vici, C., et al. (2006). "'Classical' organic acidurias, propionic aciduria, methylmalonic aciduria and isovaleric aciduria: long-term outcome and effects of expanded newborn screening using tandem mass spectrometry." J Inherit Metab Dis 29(2-3): 383-389.
Fathi, F., et al. (2014). "1H NMR based metabolic profiling in Crohn's disease by random forest methodology." Magn Reson Chem 52(7): 370-376.
Gomes, H. M., et al. (2017). "Adaptive random forests for evolving data stream classification." Machine Learning 106(9): 1469-1495.
Hiller, K., et al. (2009). "MetaboliteDetector: comprehensive analysis tool for targeted and nontargeted GC/MS based metabolome analysis." Anal Chem 81(9): 3429-3439.
Kimura, M., et al. (1999). "Automated metabolic profiling and interpretation of GC/MS
data for organic acidemia screening: a personal computer-based system." Tohoku J Exp Med 188(4): 317-334.
Mahadevan, S., et al. (2008). "Analysis of metabolomic data using support vector machines." Anal Chem 80(19): 7562-7570.
McGarry, K., et al. (2008). Exploratory Data Analysis for Investigating GC-MS Biomarkers, Berlin, Heidelberg, Springer Berlin Heidelberg.
Naccarato, A., et al. (2014). "A fast and simple solid phase microextraction coupled with gas chromatography-triple quadrupole mass spectrometry method for the assay of urinary markers of glutaric acidemias." J Chromatogr A 1372c: 253-259.
Ravera, E., et al. (2014). "DNP-Enhanced MAS NMR of Bovine Serum Albumin Sediments and Solutions." The Journal of Physical Chemistry B 118(11): 2957-2965.
Simonyan, K. and A. Zisserman (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition.
Skarysz, A., et al. (2018). Convolutional neural networks for automated targeted analysis of raw gas chromatography-mass spectrometry data. 2018 International Joint Conference on Neural Networks (IJCNN).
Spiekerkoetter, U., et al. (2010). "Current issues regarding treatment of mitochondrial fatty acid oxidation disorders." J Inherit Metab Dis 33(5): 555-561.
Stanley, C. A., et al. (2006). Disorders of Mitochondrial Fatty Acid Oxidation and Related Metabolic Pathways. Inborn Metabolic Diseases: Diagnosis and Treatment. J.
Fernandes, J.-M. Saudubray, G. van den Berghe and J. H. Walter. Berlin, Heidelberg, Springer Berlin Heidelberg: 175-190.
Tanaka, K., et al. (1980). "Gas-chromatographic method of analysis for urinary organic acids. II. Description of the procedure, and its application to diagnosis of patients with organic acidurias." Clin Chem 26(13): 1847-1853.
Tsai, I. J., et al. (2014). "Efficacy and safety of intermittent hemodialysis in infants and young children with inborn errors of metabolism." Pediatr Nephrol 29(1): 111-116.
Vapnik, V., et al. (1996). Support vector method for function approximation, regression
estimation and signal processing. Proceedings of the 9th International Conference on Neural Information Processing Systems. Denver, Colorado, MIT Press: 281-287.
Wendel, U. and H. Ogier de Baulny (2006). Branched-Chain Organic
Acidurias/Acidemias. Inborn Metabolic Diseases: Diagnosis and Treatment. J.
Fernandes, J.-M. Saudubray, G. van den Berghe and J. H. Walter. Berlin, Heidelberg, Springer Berlin Heidelberg: 245-262.
Zhou, B., et al. (2010). "SVM-based spectral matching for metabolite identification."
Conf Proc IEEE Eng Med Biol Soc 2010: 756-759.
FROM THE INTERNET
Figure 2. The demonstration of Support Vector Machine (http://i.imgur.com/WuxyO.png)
Figure 3. The demonstration of Random Forest
(https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60 d2d)
Figure 4. The demonstration of Deep Neural Networks (http://neuralnetworksanddeeplearning.com/chap6.html)
Figure 5. The demonstration of Convolution Neural Networks
(https://blogs.sap.com/2015/01/14/image-classification-with-convolutional-neural-netw orks-my-attempt-at-the-ndsb-kaggle-competition/)
MANUAL
Figure 6. A concept flowchart of process Thermal Fisher Science : GC-MS 概論 p.8 Figure 7. GC-MS data handling
Thermal Fisher Science : GC-MS 概論 p.2
Figure 8. GC-MS retention time、m/z and abundance Thermal Fisher Science : GC-MS 概論 p.43