Thesis Organization - 利用深度學習法於藥物不良反應信號偵測

Chapter 1 Introduction

1.3 Thesis Organization

The rest of this thesis is organized as follows.

Chapter 2 presents some background knowledge related to this work and summarized state-of-the-art drug-related research. Section 2.1 describes FAERS, the most well-known spontaneous reporting system. Section 2.2 describes some existing methods for detecting adverse drug reactions. Section 2.3 introduces what is deep learning and shows the structure of the convolutional neural network.

Section 2.4 summarizes some related work.

Chapter 3 presents the complete process and framework of our CNN-based ADR detection method. Section 3.1 elaborates the preprocess of transferring the FAERS data into a training set meets the CNN model. Section 3.2 describes how we input data into models and the structure of CNN-based model.

Chapter 4 presents our experiments and results. Section 4.1 describes the experimental environment and the data we used. In Section 4.2, we present the experiment to examine the effect of perplexing negatives. In Section 4.3, we show different models and their results on different test sets. Section 4.4 describes how to decide the structure of the model through experiments. Section 4.5 presents the performance results of our proposed model on ADR detection.

Finally we state conclusions and future work in Chapter 5.

Chapter 2 Background Knowledge and Related Work

In this chapter, we will present some background knowledge of our study, including FAERS system, ADR detection methods, deep learning, and related work.

2.1 FDA Adverse Event Reporting System (FAERS)

FAERS [37] is a system designed to support FDA’s post-marketing safety surveillance program for drug and therapeutic biologic products. This system receives adverse event reports, medication error reports and product quality complaints that resulting in adverse events directly from healthcare professionals and consumers.

The data collected by FAERS are released to the public every quarter since 2004.

The data schema is composed of seven related tables as organized in Table 2.1.

Table 2.1 Table summary of FAERS database

Table Name Content

Demo Record demographic information such as patient's age, weight, gender, etc., and administrative information such as the time and result of the event.

Drug Record detailed information about the drug, including the name of the drug, the dose, the route, etc.

Reac Record adverse reactions, coded in terms of preferred term in MedDRA [39].

Outc Record patient outcome codes for the event.

RPSR Record reporting source code for the event.

THER For each drug in each event, record the drug therapy start dates and end dates.

INDI Record indications, coded in terms of preferred term in MedDRA.

On average, more than 100,000 reports were collected quarterly in FAERS. There are some limitations on the usage of FAERS. For example, the casual relationship between a drug and an adverse reaction is uncertain and the reports may not contain enough information to show all the factors of an adverse events.

2.2 ADR Detection

ADR detection refers to the task for measuring the causality relationship between a drug and an adverse reaction (from spontaneous reporting system data such as FAERS) so as to identify potential ADR signals (pairs of drugs and reactions) for further evaluation.

In the literature, there have been many ADR detection methods. Most of them are statistic-based approaches and can be classified into two categories: simple disproportionality and Bayesian methods. Both require counting the number of occurrences of patterns represented in a contingency table shown as Table 2.2, where a represents the number of cases that symptom S occurs after taking drug D, b represents the number of cases that other symptoms S occurs after taking drug D, c represents the number of cases that symptom S occurs after taking other drug D, and d represents the number of cases that other symptoms S occurs after taking other drugs D.

Table 2.2 Contingency table for ADR detection

Symptom S Others Symptom S Total

Drug D a b a+b

Others Drugs D c d c+d

Total a+c b+d a+b+c+d

Typical disproportionality measurements include Proportional Reporting Ratio (PRR), Reporting Odds Ratio (ROR), chi-square (χ²), RR [23], and IC [14]. Definitions of these measurements are summarized in Table 2.3.

Table 2.3 Formula of typical disproportionality measures.

Measure Formula

The class of simple disproportionality methods compute one or several ones of the measures to determine whether a drug-ADR pair is a potential signal for further inspection. Examples include the PRR hybrid method used by EU and UK, the ROR method by the Netherlands Pharmacovigilance Foundation, which are summarized in Table 2.4.

Table 2.4 A summary of prevailing disproportionality methods Method Organization criterion

PRR95[22][25][35] UK Yellow Card MHRA[22] UK Medicines and

Healthcare products

Regulatory Agency (MHRA)

𝑃𝑃𝑃𝑃𝑃𝑃 ≥ 2, 𝑥𝑥² ≥ 4, 𝑎𝑎 ≥ 3

ROR95[22][25] Netherlands

Pharmacovigilance

The class of Bayesian methods is derived according to Bayes theorem, by measuring the ratio of drug-ADR observations to the expected values. The most famous methods are BCPNN (Bayesian Confidence Propagation Neural Network) [15], and MGPS (Multi-item Gamma-Poisson Shrinker) [25]. BCPNN, developed and used by WHO Uppsala Monitoring Centre (UMC), adopts IC as the measure [1][2]. MGPS [34], developed by DuMouchel and Pregibon (2001), is used by USA FDA, which adopts RR as the measure. A summary of these methods is shown in Table 2.5. However, these methods are built based on statistics to infer the causal relationship between drugs and adverse reactions. They have no ability to learn through the features of the data.

Table 2.5 A summary of Bayesian methods Method Organization criterion

BCPNN WHO Uppsala Monitoring Centre

Machine learning is a branch of artificial intelligence, which refers to a way for computers to automatically analyze and learn rules from the data. Many classic machine learning methods have been proposed, such as PCA (Principal Component Analysis) [21], SVM (support vector machine) [5], K-means Clustering [19], and linear regression [27]. Machine learning methods have been used in different problems, such as classification, clustering, regression and dimension reduction.

Deep learning is a branch of machine learning methods proposed in recent years.

Deep learning methods are designed to mimic the way that brain works called artificial neural network (ANN), a structure that present multiple levels of thinking. Problems such as feature extraction and classification can be solved directly in this multi-layer structure through the end to end training method, which means that we only need to give input and output.

Convolutional neural network is a model of deep learning. Lecun et al. [16]

proposed the LeNet architecture in 1998, which is one of the earliest convolutional neural networks.AlexNet is also a well-known CNN model which exhibited excellent performance on image recognition in the 2012 ImageNet Large Scale Visual Recognition Challenge [26][38], an image classification competition. The

convolutional neural network model demonstrates a very good ability to handle classification problems.

Like most deep learning methods, Convolutional neural network is designed to extract and analyze features; its general model example is shown as Figure 2.1. We can divide its structure into two parts. The first part extracts the features of data, whose structure consists of convolution layers and pooling layers. Usually, the input is represented as a matrix. The convolutional layers use filters to map the features of the input data onto a smaller matrix. Specifically, the filter will slide over the data. An area in the data is covered after each slide of the filter. If the data in this area is more similar to the features we need, the larger output will be generated.An example is shown in Figure 2.2. A convolutional layer can have multiple filters. More filters can increase the efficiency of identification, but require more computational cost and increase the risk of over-fitting.

Figure 2.1 An example structure of CNN.

Figure 2.2 An example of filter and feature extraction.

The main purpose of the pooling layer is to solve the problem of over-fitting caused by the convolutional layer. It also reduces the output size and helps to leave the required features. There are many ways to implement pooling, e.g., max pooling and mean pooling. Max pooling compares the pixels in the area, leaving the largest value, which can better preserve the texture features. An example is shown in Figure 2.3. Mean pooling replaces the pixels in the area with the regional mean to sustain the overall feature. Convolution layers and pooling layers will output multiple feature maps.

Figure 2.3 An example of max pooling.

The second part of CNN model is to analyze the features to answer the questions invoked by the user. The most adopted structure is to connect these feature maps to a multi-layered neuron called fully connected layer. After that, it will connect to the activation function to highlight the feature of the data which user needs through the conversion of the function. After these processes, the data become multiple values, and each value corresponds to a classification result. By comparing these values, it can be determined which class the data belongs to.

2.4 Related Work

In recent years, there have been many studies that analyze drug-related problems through machine learning algorithms. Jamal et al. [11] proposed a machine learning method to predict the adverse effects of drugs through the biological, chemical, and phenotypic properties of drugs provided by SIDER, DrugBank, and Pubchem [40]

databases. Superior predictive outcomes were obtained in 22 neurologically relevant

symptoms (e.g., autonomic neuropathy, psychosis). This research demonstrated the association between the properties of drug and symptoms in different academic theory.

Burbidge et al. [3] used machine learning methods on drug design issues and used SVM to analyze quantitative structure-activity relationships (QSAR). SQAR is an idea that analyzes substances in pharmacy and chemistry, the relationship between structure of a molecule and activity or property of the substance. Duan et al. [6] proposed an ensemble method combining the likelihood ratio, BCPNN and Bayesian neural network.

There are also some studies based on deep learning. Unterthiner et al. [28]

proposed a system called DeepTox that predicts the toxicity of drugs through inputting the chemical properties of the drug into a 5-layer deep neural network. Hughes et al.

[10] studied the relationship between bonds in proteins and drug toxicity through deep convolutional neural networks. Wen et al. [31] proposed a method for predicting drug-target interaction (DTI). DTI helps to infer indications and adverse reactions during drug discovery. They screened the drugs first from the Drugbank database. Then use the documented construction model in the database. Their research helps to update the DTI list. Xu et al. [32] converted the structure of the drug to the SMILES form to predict the relationship of drug-induced liver injury (DILI). The study finally achieved 86.9%

of accuracy. Park and Kellis [20] proposed to input DNA and RNA binding data into a deep learning network to calculate molecular affinity, which can be used to design drugs. Yao et al. [33] surveyed deep learning in Healthcare applications. The survey prompts some research directions and demonstrates the learning ability of convolutional neural networks.

We summarize the state-of-the-art research in Table 2.6.

Table 2.6 The state-of-the-art research

Model Target Dataset Method

PRR95 [22][25][35] Signal detection FAERS statistics ROR95 [22][25] Signal detection FAERS statistics MHRA [22] Signal detection FAERS statistics BCPNN [1][2][15] Signal detection FAERS statistics、

ML MGPS [25][34] Signal detection FAERS statistics、

ML Ensemble [6] Causal

relationship OMOP statistics、

ML DeepTox [28] Toxicity

Prediction Tox21 DL

Park and Kellis [20] Drug discovery DeepBind DL(CNN)

In summary, we can find that previous work of using deep learning mostly analyzes the microscopic properties of drugs. There is no work to consider different levels in the Anatomical Therapeutic Chemical Classification System (ATC) [42] as shown in Table 2.7, which documents the classification of drugs under different perspectives. In the thesis we will explore the rich information provided by the drug's ATC code to predict specific adverse reactions.

Table 2.7. The meaning of ATC code

The first level A letter, representing the anatomical classification.

The second level Two digits, representing the classification of therapeutics.

The third level A letter, representing the classification of pharmacological.

The fourth level A letter, representing the classification of chemical.

The fifth level Two digits, representing the chemical substance.

Chapter 3 The Proposed Deep Learning ADR Prediction Framework

In this chapter, we will describe the proposed deep learning-based framework for ADR prediction. Figure 3.1 depicts the framework structure. Each component is detailed in the following sections.

Figure 3.1 Proposed deep learning ADR detection framework

3.1 Preprocessing

Rather than using the original FAERS data, we use the ADR contingency cubes built for our iADRs analysis system [17][18]. From the ADR cubes we chose the most

detailed subcube, composed of Age, Gender, Drug name, Symptom, and the four numerical values a, b, c, d, that record the values in the corresponding 2*2 contingency table for ADR detection. Note that Age is discretized into 10 levels and the drug is represented by ATC code. Figure 3.2 shows an example of this subcube represented in table format.

Figure 3.2 Table representation of an example ADR subcube.

Using this subcube as input data, the preprocessing module is to transfer it to a training set to build our deep learning network model. In the following subsections, we will explain the main procedures, including class labeling and disambiguation, class binarization, class imbalance handling, and the roles of MedDRA and SIDER in the preprocessing.

3.1.1 Class labeling

In Section 2.1 we mentioned that the casual relationship between drugs and adverse reactions in the FAERS reports has not been verified. This uncertainty hinders the construction of training set that requires correct class labels. To solve this problem, we utilize the well-known SIDER side effect resource (abbr. SIDER), which was proposed by Kuhn et al. in 2008 [13], containing side effects (adverse reactions)

information about marketed drugs. Its data is collected from FDA, national registries and charity organizations. It has recorded 1,430 drugs, 5,868 adverse reactions, and 139,756 drug-SE pairs. In the database, adverse reactions are represented by the PT layer and the LLT layer in MedDRA, and the drug name is available in versions encoded by the STITCH system [12] and in versions encoded by the ATC system. More specifically, we inspected all data in the ADR cubes and move all cases whose drug-ADR pair was found in SIDER to the data that will be processed to form the training set.

3.1.2 Class disambiguation

In the medical and healthcare community, professionals may use different words to record the observed symptoms. We need to standardize these words to ensure we can find all the cases about the specific symptom. As such, we used MedDRA database to solve this problem. MedDRA was developed by the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH) and ICH partners, including WHO. It was designed to standardize the terminology required by regulators and industry, and is updated frequently to meet users’

requirement. MedDRA describes the symptom in five levels, as shown in Table 3.1.

Table 3.1Levels in MedDRA

Level content

Lowest Level Terms (LLT) How observations are reported Preferred Terms (PT) A distinct descriptor (single medical

concept) for a symptom, sign, etc High Level Terms (HLT) Groups in which PTs are summarized

according to anatomy, pathology, physiology, or their function High Level Group Terms (HLGT) Interrelated HLTs

System Organ Classes (SOC) Grouping of etiology, manifestation site, and purpose

LLTs contain many synonyms. So we describe the symptoms with PTs at first.

However, when we target a specific symptom, only a few cases were found to meet our needs. This will have a negative impact on the learning of the model. So, we try to describe the symptoms using a higher level. In order to keep the groups of symptoms close to meaningful synonyms, we chose the upper layer of PT, i.e. HLT.

3.1.3 Class binarization

In nature, the problem of ADR prediction is a multi-class classification task since a drug may cause many different adverse reactions, but for the following reason we transfer it into a binary class problem. Recall that the adverse reactions recorded in the SRS system are uncertain, which makes it more difficult to determine the right combination of adverse reactions when constructing the training dataset. We thus adopted the one-to-the-others strategy. That is, for each specific adverse reaction, we will build a binary classification model to predict that when a patient takes a particular drug, he will suffer the specific adverse reaction or not. For example, if we want to predict whether myocardial infarction will occur after a patient takes a specific drug,

we will use all cube instances that record myocardial infarction as positive cases, while all cube instances that record of other PTs as negative cases, as illustrated in Figure 3.3.

However, this will make the set of negative cases much larger than that of the positive cases, because there may be many kinds of adverse reactions after taking a drug, and the specific case we focus on is only one of them. This problem will be handled in the next subsection. There is another phenomenon needs special attention. Since a drug may cause more than one adverse reaction, it is very likely that after class binarization, the drug will belong to both positive and negative classes. Since our focus is on the positive case, identifying the drug causing the adverse reaction of concern, the existence of the drug in the negative class will perplex the binary model to conclude the positive decision. We name such kind of negative cases as perplexing negatives. A simple solution is eliminating the perplexing negatives from the training set. For example, in Figure 3.3, instances 2 to 5 will be eliminated. The effect of pruning perplexing negatives will be shown later in Section 4.2.

Figure 3.3 An example of one-to-the-others strategy

3.1.4 Class imbalance handling

After the class binariztion, all instances in the ADR cube are divided into two categories: those that produce the specific adverse reaction and those that do not produce the specific adverse reaction. Note that the size of the second category is significantly larger than that of the first one. For example, in the data of 2004-2006, there are only 3,521 cube instances that record myocardial infarction, but there are 672,297 instances that record other adverse reactions. This is the well-known class imbalance problem which will cause the learned model exhibit very low accuracy for the rare class.

To solve this problem, we used the SMOTE [5] approach to oversample the rare category that corresponds to the specific adverse reaction of concern. The SMOTE approach generates samples of rare class to balance the sample, as shown in Figure 3.4.

More specifically, SMOTE randomly selects a sample of rare class S1, finds the K-nearest samples of S1 of the same class, randomly selects one sample S2 from the nearest samples of S1, and generates a sample S3 which is between S1 and S2. An example is given in Figure 3.5. SMOTE will repeat this process until the classes are balance.

Figure 3.4 Description of the concept of SMOTE in oversampling process.

Figure 3.5 A depiction of the SMOTE approach.

In the SMOTE approach, we need to calculate the similarity between any two cube instances. We use different calculation methods for different types of attributes. For binary attribute like Gender, if the values of the two instances are the same, we set the similarity to 1, otherwise, set to 0. Because Age is an ordinal attribute with 10 levels, we normalize the similarity between the two instances by dividing the value by 10.

Special attention should be made for Drug, because Drug is represented by ATC code, which itself contains five different components. We thus divide it to five components, and calculate the similarity by counting the individual similarity of each component.

An example is shown in Figure 3.6.

Figure 3.6. An illustration of instance similarity calculation.

3.2 Model Building

In this section, we show the structure of the deep learning model and how we

在文檔中利用深度學習法於藥物不良反應信號偵測 (頁 11-0)