Chapter 1 Introduction
1.2 Difficulties of the interpretation for organic acidemia
The diagnosis of OAs is usually based on the identification of abnormal metabolic compounds using GC-MS analysis of urine (Naccarato, Gionfriddo et al. 2014). The identification of compounds are based on the retention time (RT) and the mass spectra of the compound (Figure 1). In general, each urine sample usually has 10 to 30 compounds need to identify and the labeling, checking of those compounds is labors. In addition, the diagnosis of specific OA relies on the combination of certain compounds identified that will also need time to group them. Because of the complexity of the above work, it will be very helpful if we can apply the artificial intelligence (AI) in this field. Also, because of the uniqueness of the testing and disease, it is suitable to develop a machine learning model for it.
Figure 1 A chromatogram of a urine sample detected by GC-MS system.
1.3 Machine learning tools
This section covers an overview of the machine learning methods that we applied to GC-MS raw data. We assumed that these methods could be employed to learn OA patterns directly from the raw data. In this study, we propose the use of convolutional neural networks (CNNs), support vector machines (SVMs), random forest (RF) and deep neural networks (DNNs) to learn patterns directly from the dataset.
1.3.1 Support vector machines (SVMs)
Support vector machines (SVMs) are popular machine learning method and widely and successfully used in classification tasks.(Cortes and Vapnik 1995, Vapnik, Golowich et al. 1996, Chang and Lin 2011) SVMs try to find the high-dimension feature space (hyper-plane) separating the instances of different classes with maximal margin in the optimization process. One limitation of SVMs is the difficulty in determining the best kernel of SVM, whose performance depends on the type of data. Nevertheless, given their success, SVMs were also selected to compare their performance with those of CNNs、RF and DNNs on GC-MS data. (Skarysz, Alkhalifah et al. 2018)
Figure 2 The demonstration of Support Vector Machine (http://i.imgur.com/WuxyO.png)
1.3.2 Random forest (RF)
Random Forest (RF) is currently one of the most used machine learning algorithms, (Gomes, Bifet et al. 2017) because the simplicity and the fact that it can be used for both classification and regression tasks. It is also a flexible, easy to use machine learning algorithm that produces a great result most of the time even without hyper-parameter tuning.
Figure 3 The demonstration of Random Forest
(https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60 d2d)
1.3.3 Deep neural networks (DNNs)
Traditional neural networks (NNs) are widely used to learn a mapping from input to output both for classification and regression tasks. A deep neural network consists of an input layer of neurons, several hidden layer of neurons, and a final layer of output neurons. In this respect, a DNNs is a much simpler structure than a CNNs, but often effective on simple problems. DNNs are known for their ability to learn patterns, and
for this reason were chosen here as a method for comparison with CNNs. Given the popularity of DNNs in machine learning, we omit further general notions and refer to the literature for an overview of neural networks for classification. (Skarysz, Alkhalifah et al. 2018)
Figure 4 The demonstration of Deep Neural Networks (http://neuralnetworksanddeeplearning.com/chap6.html)
1.3.4 Convolution neural networks (CNNs)
Deep learning techniques, especially convolutional neural networks (CNNs) have demonstrated excellent performance in image recognition and classification tasks.
CNNs can autonomously learn useful features directly from low-level data, e.g. pixels, and construct high-level features without human intervention. CNNs can also exploit geometrical properties of the data and are less affected by noise with respect to other techniques. Additionally, an increase of GC-MS as a diagnostic technology will see also an increase of available datasets: a large amount of data is known to benefit the training of deep neural networks. The use of GPU computing and dedicated hardware, which has seen a rapid development in recent years, can help process the large amount of data collected through GC-MS. (Skarysz, Alkhalifah et al. 2018)
Figure 5 The demonstration of Convolution Neural Networks
(https://blogs.sap.com/2015/01/14/image-classification-with-convolutional-neural-netw orks-my-attempt-at-the-ndsb-kaggle-competition/)
Despite the popularity of the methods described above and their achievements in various fields, their applications in the area of metabolomics are limited and almost exclusively concern preprocessed data. (Skarysz, Alkhalifah et al. 2018)
1.4 Application of deep learning in metabolomics
The diagnosis of OA can be considered as the detecting of abnormal metabolic profiling. The metabolomics has been considered as one of the fields of chemistry and biology in need of machine learning (Date and Kikuchi 2018). The machine learning approach have been successfully applied to nuclear magnetic resonance (NMR) – based metabolomics studies in several disease, such as streptococcus pneumonia infection , devil facial tumor disease, chronic obstructive pulmonary disease, Crohn’s disease, and Celiac disease (Mahadevan, Shah et al. 2008, Bertini, Calabro et al. 2009, Fathi, Majari-Kasmaee et al. 2014, Ravera, Corzilius et al. 2014). In the meanwhile, different deep learning approaches and algorithm had also been used to improve the accuracy of determination, such as knowledge discovery by Regression algorithms, Baysian algorithms, Dimensionality reduction algorithms, accuracy maximization
(KODAMA) and deep neural networks (DNNs), (Cuperlovic-Culf 2018, Date and Kikuchi 2018). Although there had been an attempt to explore feasibility of automated GC-MS biomarkers detection, but the sample size is only 22 normal and 13 abnormal samples (McGarry, Bartlett et al. 2008). Since this is a disease with standard clinical practice need, it will be very useful if we can develop a module to assist the
interpretation of OA.
1.5 Aim
In this study, we set up a model to assist the interpretation of OA via machine learning, to facilitate the application of artificial intelligence in health.
Chapter 2 Materials and Methods
This chapter describes how we process the data and materials. A concept as below.
Figure 6 A concept flowchart of process (Source: ThermoFisher manual)
2.1 GC-MS data
GC-MS is a widely applied technique in many branches of science and technology that is the golden rule for biomarkers discovery. GC-MS will be performed using standard method - 1mL urine (with spikes with different concentration of standards) will be processed by adding each standard 200-250 nmoles and then internal standard (tricarballylic acid 10 mM) 20 μL. Then 5% hydroxylamine HCl (pH7) 100 μL will be added for deriviation in room temperature for 20 minutes to convert keto group to oxime group. After that, 2mL deioninzed water will be added. Solid phase extraction will be performed by Zymark extractor using Waters Sep-Pak 3 (500 mg) AcellTM Plus QMA cartridges. The column will be conditioned by 3mL ethanol and 3mL deionized water followed by urine previously prepared. Then, 6mL 20% ethanol and 1mL hexane
will be used to wash and finalized with 2mL ethanol with 5% formic acid. The flow rate will be 2-3 ml/min. The 20μL tetracosane will be added to the collected eluent and they dried by N2 in 40℃. The dried sample will add 50μL BSTFA + 1% TMCS in 80
℃ for 20 minutes and then diluted 10 times by acetonitrile for injection. 1 μL prepared sample will be injected into GCCS (Thermo Ultra-ISQ) spectrometry through a 1079 injector kept at 280 °C in splitless mode (2 min) and then increment of temperature of 6°C/min up to 200°C and then 12°C/min up to 295°C for 6 minutes. Total time will be 36 minutes. The MS condition will be transfer line 280°C, ion source 280°C, scan by total ion chromatograph, TIC, mx/50-550 and SIM (selected ion monitoring) mode for unknown standards.
Figure 7 GC-MS data handling (Source: ThermoFisher manual)
The data interpretation will be performed by Xalibur and Tracefinder by interpreting the retention time (RT) and mass spectrum from each compounds.
Figure 8 GC-MS retention time、m/z and abundance (Source: ThermoFisher manual)
2.1.1 Data format
GC-MS data can be analysed in either a non-targeted or targeted approach.
Non-targeted analysis is the study of all detected metabolic compounds and their variability to discover potential biomarkers associated with specific organic academia disease. In targeted analysis, a defined panel of metabolic compounds (in this study we will use 34 identified compounds) is sought to detect compounds of interest. ( which are called organic academia biomarkers) (Hiller, Hangebrauk et al. 2009)
In this study, the chromatogram produced by GC-MS is for non-targeted approach.
(Figure 9). The relative abundance on the y-axis, called total ion chromatogram (TIC) is the sum of the intensities compared to internal standard across all m/z measured at the same time, i.e. at a specific retention time point (x-axis). Each peak generally represents one specific metabolic compound, although superposition of peaks occasionally occurs.
Figure 9 Chromatogram of a normal urine sample detecting by GC-MS system As well as the chromatogram, the GC-MS software (here we use ThermoFisher TraceFinder) can provide the response ratio data with 34 identified compounds in each sample. And we will use the data for targeted analysis.
2.1.2 Data processing
Data processing is very critical for machine learning, especially when the data contain abundant information. GC-MS can produce high dimensional, noisy data, for example one single sample can contain over 9 million high-resolution variables.
(Skarysz, Alkhalifah et al. 2018) Usually for clinical interpretation, the chromatogram will be cropped from the GC-MS software and annotated with identified compounds in peaks, which is a highly subjective and laborious task. But in this study we used the chromatogram directly cropped from the GC-MS software without annotation. This process saved a lot of time and manpower. And these chromatogram images were used for CNNs machine learning.
Nevertheless, the chromatogram although noisy and complicated, present unique features that distinguish them. The original idea in this study was that such features can be learned using advanced machine learning techniques like CNNs, directly from raw
GC-MS data and therefore bypassing highly complex preprocessing step. The rest of the paper will cover this idea.
For the SVM、RF and DNNs machine learning, the list of metabolic compounds and their response ratio were the input for a further multivariate statistical analysis, in which the objective was to identify metabolic compounds and their pattern that discriminate between different OAs.
The number of steps and the complexity of the process described above may lead to some challenges. For example, the choice of the best algorithm to baseline identification.
Furthermore, the preprocessing has the potential to introduce errors and variability.
But no matter chromatogram or list compound data are from the same GC-MS raw data, which means they are just the different type or contain different information in the same urine sample.
2.2 Data preparation for machine learning
This section provide the approaches to prepare the suitable format and transformation data for CNNs and SVMs, RF and DNNs. In the beginning we tried to collect the clinical urine samples and used GC-MS to get the data for this study, but it was hard to get enough samples. So we changed our strategy to collect the clinical reports which were diagnosed to OAs in Nation Taiwan University Hospital (NTUH) database. Then we traced back their GC-MS raw files and used the GC-MS raw files to generate the chromatograms and response ratio data. This study was approved by IRB (No. NTUH 201808009RINC).
2.2.1 Input format for CNNs
For CNN image dataset, due to the specific structure of the data, the pattern of each
metabolic compound is contained only in a small portion of the chromatogram, corresponding to a specific range of retention time. Applying current methods and with the supervision of experts, the exact retention time for each target compound and its classification were determined. This process generated a CNN image dataset of labelled OAs of raw data. To link processed with raw data, a dataset structure was created with the following folders: OA, OA_samples, Train, Validation, Test. OA is the name of the folder that contains all clinical organic acidemia urine samples; OA_samples is the name of the folder that classify different OAs in its folders, e.g. 3MCC; train, validation and test are the folder names for machine learning dataset.
Machine learning is a computing resources consuming process. Due to our notebook’s computing power constrain, the image resolution format were limited to 366x966 RGB or 300x900 RGB depending on the CNNs configuration and notebook’s computing resource limitation.
2.2.2 Input format and transformation for SVM, RF and DNNs
For the SVM, RF and DNN dataset, we collected the response ratio of 34 identified compounds in each urine sample’s GC-MS raw file from different organic acidemias
Dicarboxylic Aciduria (DA) Adipic acid, Suberic acid and Sebacic acid, Octanedioic acid, 7-OH octanoic acid, 3-OH-sebacic acid
Ethylmalonic Aciduria (EA) Ethylmalonic acid, Methylsuccinic acid Glutaric Aciduria type 1 (GA1) Glutaric acid and 3-OH-glutarate Glutaric Aciduria type 2 (GA2) 2-OH-gutaric acid, Glutaric acid,
3-OH-butyruc acid, Ethylmalonic acid, Dicarbocylic acids (adipic acid)
Isovaleric Academia (IVA) Isovalerylglycine
Ketonuria 3-OH-butyric acid, 2-OH-butyric acid,
3-OH-isovaleric acid
Lactic Aciduria (LA) Lactate and Pyruvate, 2-OH-butyric acid, 3-OH-butyric acid
Lactic Aciduria Ketonuria (LAK) Lactate, 3-OH-butyric acid Multiple Carboxylase Deficiency (MCD) 3-OH-isovaleric acid,
3-methylcrotonylglycine, 3-OH-propionic acid, Methylcitrate
Methylmalonic Acid (MMA) Methylmalonic acid, Methylcitrate, 3-OH-propionate
Maple Syrup Urine Disease (MSUD) 2-OH-isovleric acid,
2-OH-3-methylvaleric acid,
Propionic Acidemia (PA) 3-OH-propionic acid and Methylcitrate Table 1 Organic acidemias and related target compounds
And then transformed these data into a 34 column matrix as the input for SVM、RF and DNNs.(Table 2.)
Table 2 Part of the 34 column matrix of the input dataset for SVM/RF/DNNs
2.3 In-house organic acidemia GC-MS raw data 2.3.1 GC-MS raw files in NTUH database
The data used in this study was obtained from 690 diagnosed patients (total 727 samples) with different types of OAs and normal cases (no specific finding) in NTUH database. The target compounds in this study were various because different OAs may have different compounds as listed in Table 1. The SVM, RF and DNN dataset contained the response ratio of 34 identified compounds as listed in the first raw of Table 2.
The CNN image dataset contained 721 chromatogram images from 690 diagnosed patients with different types of OAs and normal cases (no specific finding).Below is an example of a chromatogram of a 3MCC urine sample detected by GC-MS system.
Figure 10 Chromatogram of a 3MCC urine sample detecting by GC-MS system
2.3.2 Data augmentation
To increase the robustness of the training and compensate for the limited number of data, data augmentation was applied. (Skarysz, Alkhalifah et al. 2018) We used Keras data generator “ImageDataGenerator” to increase the training images for CNNs by random transformation of the original images in the range: width_shift_range=0.2, height_shift_range=0.2, shear_range=0.2, zoom_range=0.2, Such an augmentation
changes the absolute values of images without changing their proportion, i.e. the pattern.
This augmentation step was repeated to obtain enough additional images for training the CNNs, so the network will never see the same input twice. But the inputs it saw were still heavily intercorrelated, because they come from a small number of original images—we didn’t produce new information, we could only remix existing information.
For the SVM, RF and DNN dataset, due to the limited number of data especially for some disease as AADC, GA1, PA ...etc., we needed to augment the data for training as well. We calculated the means and standard deviations of the original samples and expected it is normal distribution. Then we generate the training dataset randomly by 0.25SD (standard deviation). So the training dataset was generated artificially, not the true data from GC-MS raw files in NTUH database.
2.3.3 Training and testing sets
The CNN image dataset was randomly divided into training、validation and testing set in the proportion around 3:1:1. The training set contained 441 samples, while the validation and testing set contained 140 samples each.
The SVM、RF and DNN dataset used the artificially generated data as training dataset and take the original sample data as testing dataset.
Category CNNs SVM/RF/DNNs
8 Ketonuria 9 2 2 10088 14
9 KDA 14 4 4 10088 25
10 LA 11 3 3 10088 20
11 LAK 3 1 1 10088 8
12 MCD 7 2 2 10088 6
13 MMA 16 5 5 10088 21
14 MSUD 5 1 1 10088 5
15 MDS 45 14 14 10088 85
16 PA 6 1 1 10088 6
17 normal 267 90 90 10088 475
Total 441 140 140 181584 727
Table 3 Numbers of samples in the dataset for CNNs and SVM/RF/DNNs
The statistics of SVM/RF/DNN training and testing dataset are listed in the Table 4 and Table 5.
Table 4 Training set for SVM/RF/DNNs
Table 5 Testing set for SVM/RF/DNNs
The inequality of the groups (imbalance dataset) results from the fact that each of the considered OAs has different prevalence.
2.4 Methods
We used different machine learning models to perform the disease discrimination and classification, the confusion matrix and AUC was used to evaluate the performance of different machine learning methods and then used the accuracy, precision, recall and F1_score to compare the performance of different machine learning models.
In this study, we used Python as our programming language. Basically, the standard Keras utilities and Tensorflow backend were used to build up the different machine learning models. We also use the Nvidia CUDA toolkit to perform GPU acceleration.
Figure 11 Concept of CNNs model
2.4.1 Implementation for convolutional neural networks
With GC-MS chromatogram, a geometrical correlation occurs in the retention time dimension. Along this dimension, the response ratio of different m/z increases and decreases thereby creating peaks as the compounds exit the column. So we tested two types of filters: two- dimensional filters, and specific one-dimensional filters along the RT axis only. In the case of two-dimensional filters, sizes were set to (3,3) and (2,2) for
convolution and pooling layers. In the case of one-dimensional filters, sizes were set to (3,1) and (2,1). (Skarysz, Alkhalifah et al. 2018) The convolution neural network architecture was built to stack multiple convolutional layers with ReLU activation before pooling layers. (Figure 12) Several variants of the architecture were tested in preliminary experiments; each was based on the similar layered block consisted of four to seven convolutional layers, pooling layer and dropout layer with rate 0.5. The tested architectures were built of respectively four, five and six such blocks, followed by a fully connected layer with ReLU activation, dropout layer with rate 0.5 and the fully connected layer with softmax activation. The batch size、steps per epoch and the number of epochs were set up as 7、63 and 100 respectively. The results of these preliminary tests gave similar performance among these three architectures. The network with the best performance resulted to be the smallest with two convolutional layers. The best architecture was selected for the experiments in the rest of the paper. We recognized that further, more thorough investigations on the types of architectures and their parameters are interesting future research directions.
Figure 12 An example of CNNs blocks
2.4.2 Implementation for the support vector machine
We used the sckikt-learn module to build the support vector machine. Scikit-learn is a Python module for machine learning built on top of SciPy. In the sklearn.svm module, we used the SVC function with parameter C=1.0, cache_size=200, class_weight=None, coef0=0.0,probability=False,shrinking=True,tol=0.001,decision_function_shape='ovr', degree=3,gamma='auto_deprecated', kernel='rbf', max_iter=-1, random_state=None,
verbose=False.
Figure 13 Concept of SVM、RF and DNNs model
2.4.3 Implementation for the random forest
We used the scikit-learn module to build the random forest model. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. In the scikit-learn package we used the ensemble module to build up the random forest model, we used the RandomForestClassifier function with the parameter n_estimators = 100 and n_jobs=-1 to utilize the two CPU cores at the same time.
2.4.4 Implementation for the deep neural networks
For the DNNs implementation in this study, we used the modules like models, layers
and optimizers in Keras and the function like sequential, dense, activation,dropout,SGD and RMSprop in these modules and used the Tensorflow as the backend. The DNNs used the sigmoid activation function in the hidden layer and softmax activation function for output layer. The size of hidden layer was set up as 2 or 1 and the network was trained with backpropagation as well as learning rate lr=0.001 with RMSprop or lr=0.01 with SGD.
and optimizers in Keras and the function like sequential, dense, activation,dropout,SGD and RMSprop in these modules and used the Tensorflow as the backend. The DNNs used the sigmoid activation function in the hidden layer and softmax activation function for output layer. The size of hidden layer was set up as 2 or 1 and the network was trained with backpropagation as well as learning rate lr=0.001 with RMSprop or lr=0.01 with SGD.