Chapter 2 Background
2.2 Measures for ADR Detection
The task of adverse drug reaction detection is highly relied on judicious choice of measures. Adverse drug reaction detection measures can be divided into two categories:
the measures of disproportionality and the Bayesian method.
Two primary measures of disproportionality are Proportional Reporting Ratio (PRR) [13] and Reporting Odds Ratio (ROR) [37]. PRR refers to the proportion of ADR reports for a given drug that are related to a specific adverse reaction, divided by the corresponding proportion for all other drugs in the database. ROR refers to the ratio of a specific adverse reaction caused by the suspected drug to all other drugs, divided by the corresponding ratio of other adverse reactions. They are can be by a 2×2 contingency table, shown in Table 2.2. The PRR and ROR measures are defined as
9
follows:
PRR ≡ [a / (a + b)] / [c / (c + d)]
ROR ≡ (a / c) / (b / d)
Table 2.2 The 2×2 contingency table for ADR measurement.
Suspected ADR Without the Suspected ADR Total
Suspected Drugs a b a + b
Other Drugs c d c + d
Total a + c b + d N = a + b +c + d
The most famous Bayesian based method is the Bayesian Confidence Propagation Neural Network (BCPNN) [27], which implements Bayesian statistics in neural network architecture and calculates a measure, called information component (IC), denoting it computes the degree of association between the two variables. Suppose x is drug and y is ADR. The IC measure is defined as follows:
IC = log
2 p(x, y) / p(x)p(y)≡ log2 [a(a + b + c + d) / (a + b)(a + c)]where p(x) is the probability of drug x, p(y) is the probability of ADR y, and p(x, y) is the probability of drug x and ADR y appear together in the ADR reports. The drug is highly associated with the ADR, if the IC value of a Drug-ADR pair is higher than a threshold.
10
Table 2.3 summarizes contemporary ADR measures and the thresholds used in the pharmacovigilance community for detecting ADRs.
Table 2.3 A summary of contemporary ADR measures.
Measure Formula Threshold
Proportional
11
Chapter 3
Related Work
3.1 Duplicate Detection
The problem of duplicate detection also known as record linkage has long been studied in the statistics community. For example, a research conducted by the U.S.
census Burtan in 1985 considered the integration of different census units [22], and developed the technique of duplicates detection to determine whether two different units of records refer to the same person.
Earlier work on duplicate detection mainly focused on single field matching, that is, to determine if two fields (attributes) refer to the same value. Different types of items require different detection methods [11]. The character-based similarity metrics are designed to handle typographical error, e.g., name and address. Examples include edit distance [28], affine gap distance [48], Smith-Waterman distance [39], Jaro distance metric [49], and Q-gram distance [45]. Methods for measuring the similarly of numeric attributes, e.g., height and weight, are rather primitive. Usually, the numerical data are treated as strings and compared using the metrics described above.
In the real worlds, the records consist of multiple types of fields. So many studies have proposed different methods to solve the matching of records with multiple fields.
These approaches can be broadly divided into two categories [11]: (1) probabilistic approaches and supervised machine learning techniques, and (2) domain knowledge or distance metrics based approaches. The former requires training data, while the latter does not need training data.
12
The main idea of probabilistic approaches is to use a Bayesian inference method to classify training record pairs into two classes, M (Matching) and U (Unmatching), derive the probabilities, and perform Bayesian inference to determine the classes of unknown record pairs. The classes of supervised machine learning techniques usually transform the training data in the form of record pairs, labeled as matching or unmatching, then applies any of classification methods developed in the machine learning community, such as decision tree, SVM, neural network, KNN, etc., to solve the problem.
Unlike the first category, the domain knowledge or distance metrics based approaches require no training data. A commonly used distance-based approach is to measure the similarity between individual attributes, using the appropriate metrics described previously, and then combine these similarities to measure the similarity of two records. A threshold is set to determine the matching of the two records.
Although lots of works have been conducted on duplicate record detection, very few of them have been devoted to dataset about adverse drug reactions. To the best of our knowledge, the only work on developing duplicate detection methods tailored to the domain of adverse drug reactions is the study by Noren et al. [34]. They proposed a modified hit-miss model for automated duplicate detection in WHO drug safety database. Their method, however, only focusing on identifying high similarity record pairs, do not take into account the existence of follow-up reports, thus is inept to discriminate real duplicate from follow-up linkage.
In summary, so far very few literatures have been devoted to duplicate detection in adverse drug reaction reporting systems, and none of them have considered the existence of follow-up reports.
13
3.2 Classification
Classification is the task of assigning objects of unknown class labels to a predefined class of labels. In general, this task is achieved by learning a model from a set of data with class labels known, called training set. The learned classification model can serve two different purposes: analysis and prediction. Analysis refers to explore the factors that influence data classification; for example, from the established model, we can generate corresponding classification rules that present the factors affecting data classification. Prediction refers to using the model to predict the class label of a unknow data.
A classification technique is a systematic approach to building classification models from an input data set. A typical process of classification is shown in Figure 3.1.
First, the model is built from a training set. Second, the performance of the model is evaluated the using a test set. After the performance of the classification model meets the requirements, we can start using the model to predict new data label.
In the literature, there have been a lot of classification algorithms, for example, decision trees, rule-based method, Nearest-Neighbor, Bayesian, Artificial Neural Networks (ANN), Support Vector Machines (SVM). Most of them are clearly described in most textbooks [25][43]. In the following subsections, we will dedicate our description to two newly developed subbranches of classification methods, ensemble learning and co-training.
14
Figure 3.1 A general process for building a classification model.
3.2.1 Ensemble learning
The concept of ensemble learning is to build a group of multiple classifiers from the training data by aggregating predictions (voting) and predict reasonable label of new data made by these multiple classifiers. As shown in Figure 3.2, ensemble learning first creates multiple subset, s1
, s
2, ..., s
k, from the original training data D, and then from each subset si, 1 ≤ i ≤ t, builds a classifier ci with weight wi and finally produce the overall classifiers C(X) = w1c
1(X)+...+wkc
k(X), where X denotes an example. Many experimental results and research reports have shown that ensemble learning usually yields more accurate result than any single classification. For example, Freund and Schapire in 1996 tested 22 benchmark problems. Their results showed that by ensemble method one of the problems exhibited little improvement, four of the problems were relatively poor, and the other 18 problems received significant improvement [17].15
Figure 3.2 The basic paradigm of ensemble learning.
Conceptually, there are two types of ensemble strategy: homogeneous ensemble and heterogeneous ensemble as shown in Figure 3.3. The homogeneous ensemble applies a base learning algorithm, for example, decision tree method (J48 in Figure 3.3(a)), to different training subsets to construct multiple classifiers, assigning classifiers different weights to combine the classifiers to produce a single classifier, with higher weights for more accurate classifiers and lower weights for less accurate classifiers. The performance of this type of ensemble learning is highly relied on the way the multiple training subsets are constructed, i.e., step1 in Figure 3.1. Two most widely used approach are bagging [44][5] and boosting [16][38]. The bagging approach employs bootstrap sampling to obtain the training subset. That is, each subset is constructed by sampling with replacement, with the size equal to that of the original training data. Random Forests [6] is the most famous ensemble learning adopting this
16
approach. The boosting approach employs an iterative procedure to adaptively change the weight of each training example, i.e., the probability that a training example is selected to be included in the each subset si. Initially, all training examples receive equal weights; the first subset is constructed, and so is the first classifier. In subsequent iteration, the weights of wrongly classified examples in previous iteration are increased while the weights of those correctly classified are decreased.
The heterogeneous ensemble builds each classifier by different learning algorithms (similar to a multi-expert system), executed on either the same training set or different subsets. By integrating separate hypothesis and diverse characteristics embedded in each learning algorithm, the resulting classifier usually can yield high predictive accuracy than any single classifier built by a specific learning algorithm.
Figure 3.3(b) shows a heterogeneous ensemble composed of four different classifiers built from four different representative learning algorithms, including Bayes (Naïve Bayes), IBk (Instance based learning), JRip (Learning rules by induction), and J48 (Decision tree learning).
Figure 3.3 (a) The concept of homogeneous ensemble and (b) the concept of
heterogeneous ensemble.
17
3.2.2 Co-training learning
Co-training was introduced by Blum and Mitchell in 1998 [4] to build learning models from the data set with very small amounts of labeled examples and large amounts of unlabeled data. The original model was developed based on the assumption that there exist two different views (characterization or feature sets of data) to classify the data and these two views are conditionally independent. For example, a web page can be classified according to the words occurring at that page (one view) or the words occurring in hyperlinks pointing to that page (another view).
The co-training learning builds models in the following way. Initially, each class of features is used to build classifier from the labeled data, resulting in two separate classifiers. In subsequent iterations, the label data is augmented with unlabeled data whose selection and labels are determined corporately according to the results of the two classifiers. Then the two classifiers are retrained using the expanding training data.
The entire process continues until the performance of the classifiers converges or no new unlabeled dataset can be selected. In short, co-training uses the unlabeled data to bootstrap the classifiers, helping them to achieve better classification results.
Rationale behind the idea of co-training is intuitive. A small amount of training data to build the model is not necessarily representative of the entire data set.
Judiciously adding appropriate unlabeled data to the training data hence will enhance the representation for the entire population.
Since the work of Blum and Mitchell, the co-training learning technology has been extended from different aspects, examples including the number of classifiers [10][30][53], the multiple learning algorithms [42][52], and features selection [47][50].
All extensions retain the kernel idea of co-training, exploiting the characteristics of large amounts of unlabel data to improve the accuracy of models.
18
Chapter 4
Duplicate Detection in FAERS Dataset
4.1 Characteristics of FAERS Dataset
FAERS is a database designed to assist the FDA to monitor the safety of drugs after the listing, which includes adverse events and medication errors reported information. In United States, healthcare professionals (such as doctors, pharmacists, nurses and others) and consumers (such as patients, their families, lawyers and others) are voluntary to reporting adverse events and medication errors. These reports may also be submitted to the manufacturers. The manufacturers, however, according to the FDA regulation, are obligated to report to FDA regularly any adverse events associated with their medical products.
As described in Section 2.1, the FAERS database is composed of seven relational tables, in which the DEMO table is the master file that records information directly related to patients and the report itself. There are in total 23 attributes in DEMO file, including fields: CASE, I_F_COD, FOLL_SEQ, IMAGE, EVENT_DT, MFR_DT, FDA_DT, REPT_COD, MFR_NUM, MFR_SNDR, AGE, AGE_COD, GNDR_COD, E_SUB, WT, WT_COD, REPT_DT, OCCP_COD, DEATH_DT, TO_MFR, CONFID, REPORTER_COUNTRY. Detail meaning of each attribute is shown in Table 4.1. We particularly highlight some attributes that are important to the duplication problem.
ISR (primary key) refers to the unique number for identifying an FAERS report, CASE specifies the number for identifying an FAERS case. In other words, an FAERS case
19
(event) may have several different reports, i.e., different ISRs. The main reason behind this phenomenon is that an initial referring to the same FAERS case. This information is recorded in the I_F_COD field, for “I” denoting initial and “F” denoting following.
Table 4.1 Detail description of table DEMO.
Field name Null probability (07Q2) Explanation
ISR 0 Unique number for identifying an FAERS report.
CASE 0 Number for identifying an FAERS case.
I_F_COD 0 Initial or follow-up status of report.
FOLL_SEQ 93.1 The sequence number of a follow-up report.
IMAGE 0 Identifier for an FAERS report image.
EVENT_DT 31.3 Date adverse event occurred or began.
MFR_DT 6.9 Date manufacturer received information.
FDA_DT 0 Date FDA received report.
REPT_COD 0 Type of report submitted.
MFR_NUM 6.9 Manufacturer's unique report identifier.
MFR_SNDR 6.9 Verbatim name of manufacturer sending report.
AGE 38.8 Age.
AGE_COD 38.8 Unit.
GNDR_COD 5.8 Gender.
E_SUB 0 This report was submitted under the electronic
submissions procedure.
WT 65.9 Weight.
WT_COD 65.9 Unit.
REPT_DT ≒0 Date report was sent.
OCCP_COD 16.3 Reporter's type of occupation.
DEATH_DT 92.7 This field remains but is no longer populated with data from 2010.
TO_MFR 93.1 Voluntary reporter also notified manufacturer.
CONFID 93.1 Voluntary’s identity should not be disclosed to the
product manufacturer.
REPORTER_COUNTRY 0 Reporters’ country.
20
There are lots of missing values in the FAERS database. As an illustration, we have analyzed the dataset of 2007Q2, computing the probability for an attribute being null value. The statistics are also listed in Table 4.1. There are four attributes whose null probabilities are over 90%, including FOLL_SWQ (the sequence number of a follow up report), DEATH_DT (the death date of the patient), TO_MFR (whether or not the reporter also notified manufactures), and CONFID (whether or not the voluntary identity should not be disclosed to the manufacturer). Since null values cause many data analysis problems, we will describe in Section 5.1 our approaches for handling missing values.
4.2 Duplicate Reporting Problem in FAERS Dataset
As described in Section 4.1, healthcare professionals and consumers can report adverse events and medication errors in voluntary directly to FDA and these reports can also be submitted to the manufacturers. In addition, according to the guidance released by the FDA [15], any drug manufactures who are aware of any adverse event involving drugs belonging to their products has the obligation to report that event.
Hence, if an event involving several drugs (this is usually the case), the FDA may receive multiple reports referring to the same case, but from different manufacturers.
All of the situations described above would cause duplicate records in the FAERS database, which may or may not be identified by the FDA. For example, Table 4.2 is a sample dataset extracted from FAERS. Records #12 and #13 have the same CASE no.
but different ISR no., and both are recorded as initial reports. In other words, these two reports are duplicate, which are indeed reported by different reporters. The last two
21
records, however, show another scenario. In this case, both have different CASE no.
but the same ISR no. and are identified as initial reports. Since both records exhibit the same values in all other attributes, we conclude that they are duplicates. Unfortunately, this case was not correctly identified by the FAERS system.
The problem of duplicate detection in the FAERS database is complicated by the existence of follow-up reports. A follow-up indeed is a compensation for the initial report, which contains update information, such as information modification, medication changes, and adverse reactions changes. As a consequence, a follow-up closely resembles its initial report or other linkage follow-ups. A poorly designed detection method that overlooks this phenomenon would yield incorrect results, wrongly identifying two records of (initial, follow-up) or (follow-up, follow-up) as duplicate reports, i.e., (initial, initial). Action taken for dealing with duplicate reports is different from that for initial / follow-up or follow-up / follow-up cases. If two records are identified as duplicate, only one record should be retained, while a follow-up should be merged with its initial report or preceding follow-up to form a more accurate report. Confusing these two situations will bias the case occurrences when we perform some ADR signal detections and result in incorrect signals. To the best of our knowledge, all previous work on ADR duplicate detection does not consider differentiating follow-up linkage from real duplicate.
22
Table 4.2 FAERS data sample.
In accordance with the previous discussions, we view the problem of duplicate detection in the FAERS database as a tri-class classification problem. That is, given any two ADR records, we would like to classify this record pair into the following three scenarios (classes):
(1) Real duplicate (label is II): This denotes that two records are true duplicate of each other. In the FAERS data format, this case corresponds to records with identical CASE no., different ISR no., and both code “I” in the I_F_COD field. For example, see the 12th and 13th records in Table 4.2. We use “II” label to denote record pairs belonging to such scenario.
(2) Follow-up linkage (label is IF&FF): This denotes the relationship between the record pair is initial / follow-up or follow-up / follow-up. According to the FAERS coding format, the former corresponds to record of the same CASE no., but different ISR no., and having diffident I_F_COD, one with “I” and another with “F”. For example, the 13th and 14th records in Table 4.2 represent this case. The latter corresponds to record with identical CASE no., but different ISR no., and the same
23
I_F_COD of “F”. Records 10 and 11 in Table 4.2 is an example of this case. We use label “IF&FF” to denote this category.
(3) Others (label is OTHER): This denotes all case other than real duplicate and follow-up linkage. We use label “OTHER” to denote this category. Intuitively, real duplicate and follow-up linkage are rare situations. Most of the record pairs belong to
“OTHER” category.
24
Chapter 5
Proposed Duplicate Detection Method
In this chapter, we introduce a duplicate detection algorithm based on ensemble and co-training learning. We first describe the data preprocessing mechanisms in Section 5.1, training set construction in Section 5.2, and then detail our proposed method in Section 5.3.
5.1 Data Preparation
The first step of data preparation is the choice of attributes. We chose fifteen fields (EVENT_DT, MFR_DT, FDA_DT, REPT_COD, MFR_NUM, MFR_SNDR, AGE, GNDR_COD, WT, REPT_DT, OCCP_COD, REPORTER_COUNTRY, DRUGNAME, PT, OUTCOME) out of the FAERS database. The decision was made according to two factors: The attribute contains many missing values or is meaningless to the task of duplicate detection. Detail reasons of choosing attributes are shown in Table 5.1.
Secondly, we deleted all tuples containing missing entry. This is because if two records are both missing on the same field they may be identified as the same value during similarity calculation, thus severely biasing the analysis results. For example, the 07Q2 dataset originally contains 83977 records in FAERS; it remains 12789 records after deleting records with null values.
25
Table 5.1 Detail description for attribute selection.
Field name Null probability (07Q2)
populated with data from 2010.
TO_MFR 93.1 Y Too much null value will cause training data
reduction.
CONFID 93.1 Y Too much null value will cause training data
reduction.
REPORTER_COUNTRY 0
Finally, the data preparation ends with transforming the data file into labeled record pairs, each of which is represented by a similarity vector. Let F be a selected
26
data file consisting of g records and f attributes taken from the FAERS database, and F’
denote a copy of F. For any two records ri and r’j, ri from F and r’j from F’, we construct the following new record:
l sim(f
i1, f ’j1) sim(fi2, f ’j2) … sim(fih, f ’jh)where l denotes the label of this record pair, fik and f ’jk the values of attribute Ak
in F
and F’, respectively, for 1 ≤ i, j ≤ g and 1≤ h ≤ k, and sim(⋅) is the similarity function.Figure 5.1 illustrates the record pair transformation. After the transformation, every record is represented as a label attribute plus a vector of similarities. Figure 5.2 is a real
Figure 5.1 illustrates the record pair transformation. After the transformation, every record is represented as a label attribute plus a vector of similarities. Figure 5.2 is a real