Chapter 1 Introduction
1.3 Thesis Organization
The other chapters of this thesis are organized as follows.
In Chapter 2, we first introduce some background about adverse drug reaction detection, including adverse drug reactions, ADR signal detection, and the adverse event reporting systems.
We then discuss in Chapter 3 previous work on detecting duplicates [4][34], describe traditional classification methods [35], and introduce two advanced classification methods adopted in our study, including ensemble classification and
4
co-training classification.
In Chapter 4, we detail the problem of record duplication in the SRS databases, specifically focusing on the FAERS dataset. We describe characteristics of the FAERS dataset, explain situations for causing duplicates, and rationale the importance for differentiating follow-up cases from initial reports.
Chapter 5 presents our proposed method for detecting duplicate reports in the FAERS database. We first detail the data pre-processing, including attribute selection and transformation, and training data construction. Then, we describe our method for identifying duplication and follow-up linkage, which is an ensemble with co-training based classification method.
Chapter 6 describes the experiments and reports the results. We conducted two experiments. One is to compare our method with four representative classification methods and an ensemble of these four methods. Another one is to inspect the effects of with and without removing duplicate and follow-up to ADR signal detection.
Finally in Chapter 7, we discuss the conclusions and future work.
5
Chapter 2 Background
2.1 Drug Development Process and Reporting Systems
As the advent of modern civilization diseases and epidemics, many medical centers or manufacturers develop new drugs to fight new diseases. All of new drugs need to undergo a series of standard processes before listing (approval for market), which are beginning with pretrial experiment performed on animals, followed by several human clinical trials.
Consider the clinical trials in the USA for example. There are four phases of clinical trials. As the drug development process listed in Table 2.1 shows, Phase I confirms the safety and dosage of new drugs. The object is healthy people. Phase II concerns effectiveness and the object consists of a small group of patients, usually 100-300 volunteers. Phase III studies all the possible side effects, especially the effect and impact of long-term usage 2-10 years, as well as similar products on the market comparison. The object involved in this phase is significantly larger than that in phase II, usually 1000-3000. Indeed, phase III is the final step towards FDA approval. If the new drug survives after this phase, a new drug application (NDA) is submitted to the FDA, to make the final decision. After a new drug is listed an additional phase IV, also known as post marketing surveillance, is carried out, during which the manufacturers must track long term usage of this new drug, to obtain more comprehensive safety data.
For example according to the FDA’s regulation [15], the drug manufactures have to
6
report adverse drug event at quarterly intervals for the first three years after approval, and sporadically report any serious and adverse reactions. This is because even the strictly designed premarketing clinical trials cannot uncover all possible adverse reaction. The risks of long term usage of the new drug in the masses can only be evaluated throughout lifelong postmarketing surveillance. Therefore, many countries and drug-related agencies have established spontaneous reporting systems to collect adverse drug events, submitted by drug manufacturers, doctors, pharmacists, lawyers and other health related personnel. For example, the U.S. FDA Adverse Event Reporting System (FAERS formerly AERS) [14] run by the FDA, the Canada Vigilance Program [18] run by the Health Canada, the U.K. Yellow Card [44] run by the European Medicines Agency (EMA),and the National Reporting System of Adverse Drug Reaction in Taiwan [33].
Table 2.1 Listed drug development process [14].
Process Purpose Objects
Drug discovery Looking for new drug targets laboratory, cell lines, and animal
Pretrial Experiment of security and biological activity
laboratory, and animal
IND(Investigational New Drug) FDA examines data
Phase I clinical trial Safety and dose confirmation 20-80 healthy volunteers
Phase II clinical trial Effectiveness and adverse reactions 100-300 patient volunteers
Phase III clinical trial Confirm the effectiveness and long-term use of reaction monitoring
1000-3000 patient volunteers
NDA(New Drug Application) Application for listing and get FDA examines
Phase IV clinical trial Long-term safety monitoring after listing
7
Since our work in this study was conducted using the FAERS database, we describe this database in more detail. The FAERS database is designed to support FDA’s postmarketing safety surveillance program, all adverse event reports submitted to the FDA. The dataset in this database is published quarterly [14], starting from 2004.
The schema of FAERS database shown in Figure 2.1 is composed of seven relational data tables, connecting through the key ISR, an identifier for each report.
Figure 2.1 The schema of FAERS database.
8
(1) DEMO: Contains patient demographic and administrative information, a single record for each event report.
(2) DRUG: Contains drug/biologic information for as many medications as were reported for the event (1 or more per event).
(3) REAC: Contains all "Medical Dictionary for Regulatory Activities"
(MedDRA [31]) terms coded for the adverse event (1 or more).
(4) OUCT: Contains patient outcomes for the event (0 or more).
(5) RPSR: Contains report sources for event (0 or more).
(6) THER: Contains drug therapy start dates and end dates for the reported drugs (0 or more per drug per event).
(7) INDI: Contains all MedDRA terms coded for the indications for use (diagnoses) for the reported drugs (0 or more per drug per event).
2.2 Measures for ADR Detection
The task of adverse drug reaction detection is highly relied on judicious choice of measures. Adverse drug reaction detection measures can be divided into two categories:
the measures of disproportionality and the Bayesian method.
Two primary measures of disproportionality are Proportional Reporting Ratio (PRR) [13] and Reporting Odds Ratio (ROR) [37]. PRR refers to the proportion of ADR reports for a given drug that are related to a specific adverse reaction, divided by the corresponding proportion for all other drugs in the database. ROR refers to the ratio of a specific adverse reaction caused by the suspected drug to all other drugs, divided by the corresponding ratio of other adverse reactions. They are can be by a 2×2 contingency table, shown in Table 2.2. The PRR and ROR measures are defined as
9
follows:
PRR ≡ [a / (a + b)] / [c / (c + d)]
ROR ≡ (a / c) / (b / d)
Table 2.2 The 2×2 contingency table for ADR measurement.
Suspected ADR Without the Suspected ADR Total
Suspected Drugs a b a + b
Other Drugs c d c + d
Total a + c b + d N = a + b +c + d
The most famous Bayesian based method is the Bayesian Confidence Propagation Neural Network (BCPNN) [27], which implements Bayesian statistics in neural network architecture and calculates a measure, called information component (IC), denoting it computes the degree of association between the two variables. Suppose x is drug and y is ADR. The IC measure is defined as follows:
IC = log
2 p(x, y) / p(x)p(y)≡ log2 [a(a + b + c + d) / (a + b)(a + c)]where p(x) is the probability of drug x, p(y) is the probability of ADR y, and p(x, y) is the probability of drug x and ADR y appear together in the ADR reports. The drug is highly associated with the ADR, if the IC value of a Drug-ADR pair is higher than a threshold.
10
Table 2.3 summarizes contemporary ADR measures and the thresholds used in the pharmacovigilance community for detecting ADRs.
Table 2.3 A summary of contemporary ADR measures.
Measure Formula Threshold
Proportional
11
Chapter 3
Related Work
3.1 Duplicate Detection
The problem of duplicate detection also known as record linkage has long been studied in the statistics community. For example, a research conducted by the U.S.
census Burtan in 1985 considered the integration of different census units [22], and developed the technique of duplicates detection to determine whether two different units of records refer to the same person.
Earlier work on duplicate detection mainly focused on single field matching, that is, to determine if two fields (attributes) refer to the same value. Different types of items require different detection methods [11]. The character-based similarity metrics are designed to handle typographical error, e.g., name and address. Examples include edit distance [28], affine gap distance [48], Smith-Waterman distance [39], Jaro distance metric [49], and Q-gram distance [45]. Methods for measuring the similarly of numeric attributes, e.g., height and weight, are rather primitive. Usually, the numerical data are treated as strings and compared using the metrics described above.
In the real worlds, the records consist of multiple types of fields. So many studies have proposed different methods to solve the matching of records with multiple fields.
These approaches can be broadly divided into two categories [11]: (1) probabilistic approaches and supervised machine learning techniques, and (2) domain knowledge or distance metrics based approaches. The former requires training data, while the latter does not need training data.
12
The main idea of probabilistic approaches is to use a Bayesian inference method to classify training record pairs into two classes, M (Matching) and U (Unmatching), derive the probabilities, and perform Bayesian inference to determine the classes of unknown record pairs. The classes of supervised machine learning techniques usually transform the training data in the form of record pairs, labeled as matching or unmatching, then applies any of classification methods developed in the machine learning community, such as decision tree, SVM, neural network, KNN, etc., to solve the problem.
Unlike the first category, the domain knowledge or distance metrics based approaches require no training data. A commonly used distance-based approach is to measure the similarity between individual attributes, using the appropriate metrics described previously, and then combine these similarities to measure the similarity of two records. A threshold is set to determine the matching of the two records.
Although lots of works have been conducted on duplicate record detection, very few of them have been devoted to dataset about adverse drug reactions. To the best of our knowledge, the only work on developing duplicate detection methods tailored to the domain of adverse drug reactions is the study by Noren et al. [34]. They proposed a modified hit-miss model for automated duplicate detection in WHO drug safety database. Their method, however, only focusing on identifying high similarity record pairs, do not take into account the existence of follow-up reports, thus is inept to discriminate real duplicate from follow-up linkage.
In summary, so far very few literatures have been devoted to duplicate detection in adverse drug reaction reporting systems, and none of them have considered the existence of follow-up reports.
13
3.2 Classification
Classification is the task of assigning objects of unknown class labels to a predefined class of labels. In general, this task is achieved by learning a model from a set of data with class labels known, called training set. The learned classification model can serve two different purposes: analysis and prediction. Analysis refers to explore the factors that influence data classification; for example, from the established model, we can generate corresponding classification rules that present the factors affecting data classification. Prediction refers to using the model to predict the class label of a unknow data.
A classification technique is a systematic approach to building classification models from an input data set. A typical process of classification is shown in Figure 3.1.
First, the model is built from a training set. Second, the performance of the model is evaluated the using a test set. After the performance of the classification model meets the requirements, we can start using the model to predict new data label.
In the literature, there have been a lot of classification algorithms, for example, decision trees, rule-based method, Nearest-Neighbor, Bayesian, Artificial Neural Networks (ANN), Support Vector Machines (SVM). Most of them are clearly described in most textbooks [25][43]. In the following subsections, we will dedicate our description to two newly developed subbranches of classification methods, ensemble learning and co-training.
14
Figure 3.1 A general process for building a classification model.
3.2.1 Ensemble learning
The concept of ensemble learning is to build a group of multiple classifiers from the training data by aggregating predictions (voting) and predict reasonable label of new data made by these multiple classifiers. As shown in Figure 3.2, ensemble learning first creates multiple subset, s1
, s
2, ..., s
k, from the original training data D, and then from each subset si, 1 ≤ i ≤ t, builds a classifier ci with weight wi and finally produce the overall classifiers C(X) = w1c
1(X)+...+wkc
k(X), where X denotes an example. Many experimental results and research reports have shown that ensemble learning usually yields more accurate result than any single classification. For example, Freund and Schapire in 1996 tested 22 benchmark problems. Their results showed that by ensemble method one of the problems exhibited little improvement, four of the problems were relatively poor, and the other 18 problems received significant improvement [17].15
Figure 3.2 The basic paradigm of ensemble learning.
Conceptually, there are two types of ensemble strategy: homogeneous ensemble and heterogeneous ensemble as shown in Figure 3.3. The homogeneous ensemble applies a base learning algorithm, for example, decision tree method (J48 in Figure 3.3(a)), to different training subsets to construct multiple classifiers, assigning classifiers different weights to combine the classifiers to produce a single classifier, with higher weights for more accurate classifiers and lower weights for less accurate classifiers. The performance of this type of ensemble learning is highly relied on the way the multiple training subsets are constructed, i.e., step1 in Figure 3.1. Two most widely used approach are bagging [44][5] and boosting [16][38]. The bagging approach employs bootstrap sampling to obtain the training subset. That is, each subset is constructed by sampling with replacement, with the size equal to that of the original training data. Random Forests [6] is the most famous ensemble learning adopting this
16
approach. The boosting approach employs an iterative procedure to adaptively change the weight of each training example, i.e., the probability that a training example is selected to be included in the each subset si. Initially, all training examples receive equal weights; the first subset is constructed, and so is the first classifier. In subsequent iteration, the weights of wrongly classified examples in previous iteration are increased while the weights of those correctly classified are decreased.
The heterogeneous ensemble builds each classifier by different learning algorithms (similar to a multi-expert system), executed on either the same training set or different subsets. By integrating separate hypothesis and diverse characteristics embedded in each learning algorithm, the resulting classifier usually can yield high predictive accuracy than any single classifier built by a specific learning algorithm.
Figure 3.3(b) shows a heterogeneous ensemble composed of four different classifiers built from four different representative learning algorithms, including Bayes (Naïve Bayes), IBk (Instance based learning), JRip (Learning rules by induction), and J48 (Decision tree learning).
Figure 3.3 (a) The concept of homogeneous ensemble and (b) the concept of
heterogeneous ensemble.
17
3.2.2 Co-training learning
Co-training was introduced by Blum and Mitchell in 1998 [4] to build learning models from the data set with very small amounts of labeled examples and large amounts of unlabeled data. The original model was developed based on the assumption that there exist two different views (characterization or feature sets of data) to classify the data and these two views are conditionally independent. For example, a web page can be classified according to the words occurring at that page (one view) or the words occurring in hyperlinks pointing to that page (another view).
The co-training learning builds models in the following way. Initially, each class of features is used to build classifier from the labeled data, resulting in two separate classifiers. In subsequent iterations, the label data is augmented with unlabeled data whose selection and labels are determined corporately according to the results of the two classifiers. Then the two classifiers are retrained using the expanding training data.
The entire process continues until the performance of the classifiers converges or no new unlabeled dataset can be selected. In short, co-training uses the unlabeled data to bootstrap the classifiers, helping them to achieve better classification results.
Rationale behind the idea of co-training is intuitive. A small amount of training data to build the model is not necessarily representative of the entire data set.
Judiciously adding appropriate unlabeled data to the training data hence will enhance the representation for the entire population.
Since the work of Blum and Mitchell, the co-training learning technology has been extended from different aspects, examples including the number of classifiers [10][30][53], the multiple learning algorithms [42][52], and features selection [47][50].
All extensions retain the kernel idea of co-training, exploiting the characteristics of large amounts of unlabel data to improve the accuracy of models.
18
Chapter 4
Duplicate Detection in FAERS Dataset
4.1 Characteristics of FAERS Dataset
FAERS is a database designed to assist the FDA to monitor the safety of drugs after the listing, which includes adverse events and medication errors reported information. In United States, healthcare professionals (such as doctors, pharmacists, nurses and others) and consumers (such as patients, their families, lawyers and others) are voluntary to reporting adverse events and medication errors. These reports may also be submitted to the manufacturers. The manufacturers, however, according to the FDA regulation, are obligated to report to FDA regularly any adverse events associated with their medical products.
As described in Section 2.1, the FAERS database is composed of seven relational tables, in which the DEMO table is the master file that records information directly related to patients and the report itself. There are in total 23 attributes in DEMO file, including fields: CASE, I_F_COD, FOLL_SEQ, IMAGE, EVENT_DT, MFR_DT, FDA_DT, REPT_COD, MFR_NUM, MFR_SNDR, AGE, AGE_COD, GNDR_COD, E_SUB, WT, WT_COD, REPT_DT, OCCP_COD, DEATH_DT, TO_MFR, CONFID, REPORTER_COUNTRY. Detail meaning of each attribute is shown in Table 4.1. We particularly highlight some attributes that are important to the duplication problem.
ISR (primary key) refers to the unique number for identifying an FAERS report, CASE specifies the number for identifying an FAERS case. In other words, an FAERS case
19
(event) may have several different reports, i.e., different ISRs. The main reason behind this phenomenon is that an initial referring to the same FAERS case. This information is recorded in the I_F_COD field, for “I” denoting initial and “F” denoting following.
Table 4.1 Detail description of table DEMO.
Field name Null probability (07Q2) Explanation
ISR 0 Unique number for identifying an FAERS report.
CASE 0 Number for identifying an FAERS case.
I_F_COD 0 Initial or follow-up status of report.
FOLL_SEQ 93.1 The sequence number of a follow-up report.
IMAGE 0 Identifier for an FAERS report image.
EVENT_DT 31.3 Date adverse event occurred or began.
MFR_DT 6.9 Date manufacturer received information.
FDA_DT 0 Date FDA received report.
REPT_COD 0 Type of report submitted.
MFR_NUM 6.9 Manufacturer's unique report identifier.
MFR_SNDR 6.9 Verbatim name of manufacturer sending report.
AGE 38.8 Age.
AGE_COD 38.8 Unit.
GNDR_COD 5.8 Gender.
E_SUB 0 This report was submitted under the electronic
submissions procedure.
WT 65.9 Weight.
WT_COD 65.9 Unit.
REPT_DT ≒0 Date report was sent.
OCCP_COD 16.3 Reporter's type of occupation.
DEATH_DT 92.7 This field remains but is no longer populated with data from 2010.
TO_MFR 93.1 Voluntary reporter also notified manufacturer.
CONFID 93.1 Voluntary’s identity should not be disclosed to the
product manufacturer.
REPORTER_COUNTRY 0 Reporters’ country.
20
There are lots of missing values in the FAERS database. As an illustration, we have analyzed the dataset of 2007Q2, computing the probability for an attribute being null value. The statistics are also listed in Table 4.1. There are four attributes whose null probabilities are over 90%, including FOLL_SWQ (the sequence number of a follow up report), DEATH_DT (the death date of the patient), TO_MFR (whether or not the reporter also notified manufactures), and CONFID (whether or not the voluntary identity should not be disclosed to the manufacturer). Since null values cause many data analysis problems, we will describe in Section 5.1 our approaches for handling missing values.
4.2 Duplicate Reporting Problem in FAERS Dataset
As described in Section 4.1, healthcare professionals and consumers can report adverse events and medication errors in voluntary directly to FDA and these reports can also be submitted to the manufacturers. In addition, according to the guidance released by the FDA [15], any drug manufactures who are aware of any adverse event involving drugs belonging to their products has the obligation to report that event.
Hence, if an event involving several drugs (this is usually the case), the FDA may receive multiple reports referring to the same case, but from different manufacturers.
All of the situations described above would cause duplicate records in the FAERS database, which may or may not be identified by the FDA. For example, Table 4.2 is a sample dataset extracted from FAERS. Records #12 and #13 have the same CASE no.
but different ISR no., and both are recorded as initial reports. In other words, these two
but different ISR no., and both are recorded as initial reports. In other words, these two