Preprocessing - The Proposed Deep Learning ADR Prediction

Chapter 3 The Proposed Deep Learning ADR Prediction

3.1 Preprocessing

Rather than using the original FAERS data, we use the ADR contingency cubes built for our iADRs analysis system [17][18]. From the ADR cubes we chose the most

detailed subcube, composed of Age, Gender, Drug name, Symptom, and the four numerical values a, b, c, d, that record the values in the corresponding 2*2 contingency table for ADR detection. Note that Age is discretized into 10 levels and the drug is represented by ATC code. Figure 3.2 shows an example of this subcube represented in table format.

Figure 3.2 Table representation of an example ADR subcube.

Using this subcube as input data, the preprocessing module is to transfer it to a training set to build our deep learning network model. In the following subsections, we will explain the main procedures, including class labeling and disambiguation, class binarization, class imbalance handling, and the roles of MedDRA and SIDER in the preprocessing.

3.1.1 Class labeling

In Section 2.1 we mentioned that the casual relationship between drugs and adverse reactions in the FAERS reports has not been verified. This uncertainty hinders the construction of training set that requires correct class labels. To solve this problem, we utilize the well-known SIDER side effect resource (abbr. SIDER), which was proposed by Kuhn et al. in 2008 [13], containing side effects (adverse reactions)

information about marketed drugs. Its data is collected from FDA, national registries and charity organizations. It has recorded 1,430 drugs, 5,868 adverse reactions, and 139,756 drug-SE pairs. In the database, adverse reactions are represented by the PT layer and the LLT layer in MedDRA, and the drug name is available in versions encoded by the STITCH system [12] and in versions encoded by the ATC system. More specifically, we inspected all data in the ADR cubes and move all cases whose drug-ADR pair was found in SIDER to the data that will be processed to form the training set.

3.1.2 Class disambiguation

In the medical and healthcare community, professionals may use different words to record the observed symptoms. We need to standardize these words to ensure we can find all the cases about the specific symptom. As such, we used MedDRA database to solve this problem. MedDRA was developed by the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH) and ICH partners, including WHO. It was designed to standardize the terminology required by regulators and industry, and is updated frequently to meet users’

requirement. MedDRA describes the symptom in five levels, as shown in Table 3.1.

Table 3.1Levels in MedDRA

Level content

Lowest Level Terms (LLT) How observations are reported Preferred Terms (PT) A distinct descriptor (single medical

concept) for a symptom, sign, etc High Level Terms (HLT) Groups in which PTs are summarized

according to anatomy, pathology, physiology, or their function High Level Group Terms (HLGT) Interrelated HLTs

System Organ Classes (SOC) Grouping of etiology, manifestation site, and purpose

LLTs contain many synonyms. So we describe the symptoms with PTs at first.

However, when we target a specific symptom, only a few cases were found to meet our needs. This will have a negative impact on the learning of the model. So, we try to describe the symptoms using a higher level. In order to keep the groups of symptoms close to meaningful synonyms, we chose the upper layer of PT, i.e. HLT.

3.1.3 Class binarization

In nature, the problem of ADR prediction is a multi-class classification task since a drug may cause many different adverse reactions, but for the following reason we transfer it into a binary class problem. Recall that the adverse reactions recorded in the SRS system are uncertain, which makes it more difficult to determine the right combination of adverse reactions when constructing the training dataset. We thus adopted the one-to-the-others strategy. That is, for each specific adverse reaction, we will build a binary classification model to predict that when a patient takes a particular drug, he will suffer the specific adverse reaction or not. For example, if we want to predict whether myocardial infarction will occur after a patient takes a specific drug,

we will use all cube instances that record myocardial infarction as positive cases, while all cube instances that record of other PTs as negative cases, as illustrated in Figure 3.3.

However, this will make the set of negative cases much larger than that of the positive cases, because there may be many kinds of adverse reactions after taking a drug, and the specific case we focus on is only one of them. This problem will be handled in the next subsection. There is another phenomenon needs special attention. Since a drug may cause more than one adverse reaction, it is very likely that after class binarization, the drug will belong to both positive and negative classes. Since our focus is on the positive case, identifying the drug causing the adverse reaction of concern, the existence of the drug in the negative class will perplex the binary model to conclude the positive decision. We name such kind of negative cases as perplexing negatives. A simple solution is eliminating the perplexing negatives from the training set. For example, in Figure 3.3, instances 2 to 5 will be eliminated. The effect of pruning perplexing negatives will be shown later in Section 4.2.

Figure 3.3 An example of one-to-the-others strategy

3.1.4 Class imbalance handling

After the class binariztion, all instances in the ADR cube are divided into two categories: those that produce the specific adverse reaction and those that do not produce the specific adverse reaction. Note that the size of the second category is significantly larger than that of the first one. For example, in the data of 2004-2006, there are only 3,521 cube instances that record myocardial infarction, but there are 672,297 instances that record other adverse reactions. This is the well-known class imbalance problem which will cause the learned model exhibit very low accuracy for the rare class.

To solve this problem, we used the SMOTE [5] approach to oversample the rare category that corresponds to the specific adverse reaction of concern. The SMOTE approach generates samples of rare class to balance the sample, as shown in Figure 3.4.

More specifically, SMOTE randomly selects a sample of rare class S1, finds the K-nearest samples of S1 of the same class, randomly selects one sample S2 from the nearest samples of S1, and generates a sample S3 which is between S1 and S2. An example is given in Figure 3.5. SMOTE will repeat this process until the classes are balance.

Figure 3.4 Description of the concept of SMOTE in oversampling process.

Figure 3.5 A depiction of the SMOTE approach.

In the SMOTE approach, we need to calculate the similarity between any two cube instances. We use different calculation methods for different types of attributes. For binary attribute like Gender, if the values of the two instances are the same, we set the similarity to 1, otherwise, set to 0. Because Age is an ordinal attribute with 10 levels, we normalize the similarity between the two instances by dividing the value by 10.

Special attention should be made for Drug, because Drug is represented by ATC code, which itself contains five different components. We thus divide it to five components, and calculate the similarity by counting the individual similarity of each component.

An example is shown in Figure 3.6.

Figure 3.6. An illustration of instance similarity calculation.

在文檔中利用深度學習法於藥物不良反應信號偵測 (頁 24-31)