Chapter 1 Introduction
1.3 Thesis Organization
The other chapters of this thesis are organized as follows. In Chapter 2, related background is described, including the concept of ADR signal, different measures used for detecting the ADR, the concept of OLAP, and data cube technology that facilitates the OLAP operation. In Chapter 3, we review related work about ADR detection from two perspectives:
statistical and data mining approaches. The statistical approach is mainly based on Bayes theorem, while the data mining approach is based on classification or association analysis.
The details of our proposed new data cube, called MDC cube, is described in Chapter 4. The deficiency for using contingency cube to detect adverse drug interactions is elaborated first.
Two different representations of the MDC cube are then presented. We also discuss how to efficiently compute the content of MDC cubes from the raw input data and how to detect adverse drug interactions with the proposed two types of MDC cubes. Chapter 5 shows the experimental results we conducted over AERS dataset, which consist of two parts: The first part focuses on evaluating the performance of the proposed MDC cube based methods to detect adverse drug interactions. The second part compares the execution time to compute as well as the space required to store the proposed two types of MDC cubes. Finally, conclusions and future work are described in Chapter 6.
Chapter 2 Background
In general, an ADR signal can be regarded as an association between drugs and symptoms. Formally, it can be represented as
Predicate 1, Predicate 2, …, Predicate N, Drug1, Drug2,…, Drug T →Symptom
For example, the ADR “eating ’Accutane‘ to cure skin problem (Acne) would give birth to abnormal baby“ can be represented as
Gender = Female , Drug = Accutane → Symptom = Abnormal baby.
In the following of this chapter, we will first describe contemporary measurements for ADR signals, and then present OLAP and data cube technology.
2.1 Measures for ADR Detection
In the pharmacy industry, the task of adverse drug reaction detection is also called pharmacovigilance [27]. In recent years, there have been many measures proposed for pharmacovigilance. Almost all of the measures require counting the accumulation of large volumes of adverse events, which of them can be divided into two categories: the measures of disproportionality and the Bayesian method.
In the first category, the measures of disproportionality usually are computed using the 2×2 contingency table to count two different statistical measures: Proportional Reporting Ratio (PRP) and Reporting Odds Ratio (ROR). The PRP measure is used by the U.K. Yellow
Card database [5] and UK Medicines and Healthcare products Regulatory Agency (MHRA) [5], while the ROR measure is used by the Netherlands Pharmacovigilance Foundation [4].
PRR refers to the proportion of ADR reports for a given drug that are related to a specific adverse reaction, divided by the corresponding proportion for all other drugs in the database.
ROR refers to the ratio of a specific adverse reaction caused by the suspected drug to all other drugs, divided by the corresponding ratio of other adverse reactions. Consider the 2×2 contingency table in Table 2.1. The PRR and ROR measure can be defined as follows:
d
Table 2-1. The 2×2 contingency table for ADR measurement.
Suspected ADR
Without the
Suspected ADR Total Suspected
Drugs a b a + b
Other
Drugs c d c + d
Total a + c b + d N = a + b +c + d
Another category of measures is mainly used by Bayesian based method. The most well-know is the Bayesian Confidence Propagation Neural Network (BCPNN) used by WHO [1][14]. This approach implements Bayesian statistics in neural network architecture and calculates a measure, called information component (IC), which indeed measures the degree of association between the two variables. Let x represent drug and y represent ADR. The IC
measure is defined in the following:
where p(x) is the probability of drug x, p(y) is the probability of ADR y, and p(x, y) is the probability of drug x and ADR y appear together in the ADR reports. If the IC value of a Drug-ADR pair is higher than a threshold, the drug is regarded to have a significant association with the ADR.
In summary, Table 2.2 shows the contemporary ADR measures and the thresholds used in the pharmacovigilance community for detecting ADRs.
Table 2-2. A summary of contemporary ADR measures.
Measure Formula Threshold
Proportional
(SE: Standard Error, SD: Standard Deviation)
2.2 OLAP and Data Cube
On-Line Analytical Processing (OLAP), coined by Codd et al. [20], has become the major analysis tool for data warehouse systems. The OLAP tool consists of four basic analytical operations: roll-up, drill-down, pivoting, slicing and dicing, as a whole providing user easier and faster interaction with the data warehouses to examine the data from different aggregation viewpoints. To facilitate the interactive analysis of data, the OLAP tool embodies
a pre-computation technology that pre-compute and store the aggregation results of the warehouse data into a multidimensional view, called data cube (or OLAP cube). A data cube can be viewed as a multidimensional array (or table), each dimension representing a viewpoint of analyzers to perceive the meaning of the numerical value (facts) stored in each cube cell. For example, Figure 2-1 illustrates an ADR occurrence data cube composed of three dimensions, Drug, Symptom and Patient. The cube stores the number of reports in the reporting system that describes some specific adverse reaction observed from some specific group of patients taking drug specific drug.
s1 s2 s3 s4
d1 d2d3 Drug
p1 p2 p3 p4 Patient
Figure 2-1. An ADR occurrence data cube composed of three dimensions, Drug, Symptom and Patient.
Chapter 3
Previous Work for ADR Detection
3.1 Statistical Approach
Huang et al. [8] proposed a statistical method based on the chi-square test and conditional probability, to discover the drug-drug interaction caused ADRs. First they use the chi-square test to calculate the dependency of entire drugs and symptom pairs. The chi-square test, however, can only show the relative strength of association, but cannot distinguish whether the ADR is caused by drug-drug interaction or by a single drug. Therefore, they use the conditional probability to resolve this problem.
The Chi-square was used in many usages such as independence testing, goodness-of-fit testing, significance of change testing and homogeneity [3]. Extend the Chi-square method to look for a set of drug and an association relation to each symptom. The Chi-square inwardness is to compare the observed frequencies in the table with the frequencies expected for independence; the statistic function is as follows:
∑ −
where i and j are the record and attributes (rows and columns) in data table, Oij is the observed value for i’th record and j’th attributes (i and j) , and Eij is the expected value driven from the null hypothesis. If the data is huge, X 2 is almost equal χ2 distributed.
The conditional probability is used on confirmative interacted data. They want to separate the information about when using some drugs altogether or alone that there will arise a probability when it is interactive drugs situation of a symptom.
3.2 Data Mining Approach
3.2.1 ABCM-MS method
ABCM-MS is a variant of the associative classification method [14], first discovering all frequent itemsets, and then from which constructing qualified classification rules (or model).
ABCM-MS adopts a data structure analogous to the FP-tree used in the well-known efficient frequent itemsets mining algorithm, FP-growth [9], called CR-tree, to store possible classification rules. The primary advantages ABCM-MS are that the mining process does not have to generate candidate itemsets, and it can alleviate unnecessary computation spent on counting items uninteresting to the users, that is, treating the mining attributes and drug as item constraints of association rule [13] and pruning items not conforming to the constraints.
ABCM-MS consists of two stages: off-line process and on-line process. During the offline process, ABCM-MS first prunes infrequent items and classes in transaction dataset, and stores the results to disk. Then, it performs a transaction reduction by removing those items not belonging to attributes within the user specified risk patterns. At the online mode, it first uses the reduced transactions to construct a CR-tree, which is then used to derive the ADR rules. Consider the example data set in Table 3-1. The reduced transaction data set and CR-tree are shown in Table 3-2 and Figure 3-1, respectively.
Table 3-1. An example dataset of ADR reports consisting of three attributes, Age, Drug,
Table 3-2. The corresponding reduced transaction of Table 3-1. (threshold = 3)
TID Transaction Class label
1 Age > 60, d1 s1
Figure 3-1. The initial CR-tree generated from Table 3-1.
3.2.2 CBM-SS method
The core to the CBM-SS method is the concept of contingency cube, which is a variant
of OLAP cube dedicated to store the values to be used in the 2×2 contingency table. A contingency cube defined on the ADR reporting dataset is an n-dimensional cube C[A1, A2, …, An], where A1, A2, …, An-2 denote the predicate attributes, An-1 = Drug and An = PT. Except An−1 and An, the value set associated with each dimension Ai is domain(Ai) ∪ {*}, where domain(Ai) denotes the set of distinct values of Ai and “*” denotes “any” or “don’t care”.
Note the value sets associated with drug and PT do not include “*” because the contingency table is always associated with some specific drug and symptom. Each cell C[a1, a2, …, an] in the cube store conceptually the corresponding contingency table of Drug = an−1 and PT = an
with predicates A1 = a1, A2 = a2, …, An−2 = an−2. For example, consider the following ADR rule
R1: Gender = Female, Drug = Accutance → Symptom = abnormal baby.
The corresponding contingency table used for measuring the importance of this rule is shown in Table 3-3. And the corresponding 3-D contingency cube composed of Gender, Drug and Symptom is show in Figure 3-2.
Table 3-3. The 2×2 contingency table for identifying “Accutance - abnormal baby”
constrained by Gender = Female.
Gender=Female abnormal baby All other ADRs
Accutance a b
All other drugs c d
s4 others d1
others
d1 d2 d3 d4
g1g2 Gender
Drug
s1 s2 s3 s4 PT
*
Figure 3-2. An example 3-D contingency cube C(Gender, Drug, PT).
Intuitionally, the values of each cell can be obtained by aggregating the occurrences in the data warehouse. However, the reality is more complicated than we expect. The computation of values b, c, d in the contingency table involves the occurrence of negative items, which are not immediately accessible. To avoid counting through negative item, CBM-SS only store occurrences associated with positive items, and compute other values with a simple calculation. Given a contingency table composed of predicates P1 = v1, P2 = v2, …, Pk = vk, Drug = d, and PT = s, Table 3-4 details the formulas for computing all value cells a, b, c, d, where for simplicity we use count(v1, v2, …, vk, d, s) to denote the occurrences of itemset {P1 = v1, P2 = v2, …, Pk = vk, Drug = d, PT = s}. Each of the counts used in Table 3-4 indeed corresponds to a cell in the OLAP data cube. Figure 3-3 shows the correspondence between the values of the contingency table and the OLAP cube cells for the rule R1.
Table 3-4. The formulas for computing each cell in 2×2 contingency table.
Cell The formula
a count(p1, p2, … pn, d, s) b count(p1, p2, … pn, d) – a c count(p1, p2, … pn, s) – a d count(p1, p2, … pn) – a – b – c
Figure 3-3. An illustration of correspondence between contingency table and OLAP cube.
Gender =g1 PT = s1 other PTs drug = d1 count(g1, d1, s1) count(g1, d1) – a other drugs count(g1, s1) – a count(g1) – a – b – c
drug
g1 g2 *
s1 s2 s3 * d1
d2 d3
*
Symptom Gender
Chapter 4
Adverse Drug Interactions Detection with MDC Cub e
4.1 Problems with Contingency Cube Based Approach
As we have stated in Section 3.2, most previously proposed methods are not suitable for online analysis and detection of ADR signals, except the contingency cube based approach, CBM-SS. Nevertheless, CBM-SS can only detect ADRs caused by single drug; it may generate incorrect results when dealing with ADRs caused by drug interactions. The rationale is detailed in the following.
It is better to explain the deficiency with an example. Let us consider the data in Table 4-1, which consists of five reports, involving three attributes, Gender, Drug, and PT (Symptom). The corresponding contingency cube represented in the form of OLAP cube is shown in Figure 4-1. Now suppose we want to measure the following rule:
R2: g1, d1, d2 → s1
To do so we need to know the value count(g1, d1, d2, s1), which unfortunately is not available immediately from the OLAP cube in Figure 4-1. One possible way is to calculate
the minimum of the values stored in cells [g1, d1, s1] and [g1, d2, s1], i.e., min {count(g1, d1, s1), count(g1, d2, s1)} = 2.
So we obtain count(g1, d1, d2, s1) = 2, which is incorrect. Indeed, the correct value is 1. The main reason lead to this incorrect result is that the contingency cube provides no mechanism for retaining the aggregation contributed by non-atomic dimensions. In other words, the atomic dimensional structure of contingency cube cannot distinguish whether an accumulation is contributed, for a specific dimension, by a single or multiple dimensional values. In this example, note that Drug and PT are non-atomic dimensions. So, there is no way to know the first report containing d1 and d2 simultaneously, which contributes to the accumulation stored in cells [g1, d1, s1] and [g1, d2, s1], accounting for the extra increment in count(g1, d1, d2, s1). In light of this deficiency, we propose a new type of contingency cube, which embodies non-atomic dimensions into the cube structure. The definition and implementation detail will be elaborated in the following sections.
Table 4-1. An example dataset of ADR reports.
Tid Gender Drug PT 1 g1 d1,d2,d3 s1
2 g2 d2,d3 s2
3 g1 d2, d3 s1, s2
4 g1 d1, d3 s1, s3
5 g2 d2 s1
d1 d2 d3
* g1 g2 Gender
s1 s2 s3 * Symptom
*
Figure 4-1. The corresponding 3-D contingency cube in OLAP cube format for the example in Table 4-1.
4.2 Definition of MDC Cube
In light of the deficiency of contingency cube, we propose the new type of contingency cube, MDC cube, which embodies the concept of multivalued dimension, i.e., the member of a dimension can be a set of items (multivalues). Formally, a MDC cube defined on the ADR reporting dataset is an n-dimensional cube MC[A1, A2, …, An], where A1, A2, …, An-2 denote the predicate attributes, An−1 = Drug and An = PT. Except An−1 and An, the value set associated with each dimension Ai is domain(Ai) ∪ {*}, where domain(Ai) denotes the set of distinct values of Ai and “*” denotes “any” or “don’t care”. The domain for PT is the same as that in the contingency cube, Drug is a multivalued dimension. Specifically, let D be the set of all drugs. Then the domain of Drug dimension is equal to the power set of D except the empty set, i.e., domain(Drug) = 2D – ∅. MDC cube again store conceptually the corresponding contingency table of Drug = an−1 and PT = an with predicates A1 = a1, A2 = a2, An−2 = an−2. But note that an−1 here may be a set of drugs, {d1, d2, …}. Let us consider the example in Table 4-1 again. The corresponding 3-D MDC cube is depicted in Figure 4-2, where the cell [g1, {d1, d2}, s4] stores the contingency table essential for computing the following ADR rule:
R3: Gender = g1, Drug = {d1, d2} → Symptom = s4.
Figure 4-2. The corresponding MDC Cube of the data set in Table 4-1.
4.3 Structure Implementation of MDC Cube
In this section, we will present how to implement the concept of MDC cube using the contemporary database technology. The primary concern is multivalued dimension and the cube cell. We propose two different types of representations, based on the strategies for implementing the multivalued dimension and the cube cell. For differentiation, we call them type-1 structure and type-2 structure.
4.3.1 Type-1 Structure
The first implementation, type-1 structure, follows the OLAP cube implementation for contingency cube, with the extension of using multivalued-type attribute to represent multivalued dimension. Most database systems provide string data type, which suffices for
the purpose. Some object-relational DBMSs, such as Oracle and DB2, even support collection (set) data type, which provides a more convenient index scheme. Since type-1 structure preserves the two features of OLAP cube, positive value dimension and atomic cell value, the values a, b, c, d in the contingency table are computed in the same way as described in Table 3-4. Let {p1, p2, …, pn, d, s} be the set of attributes involved in ADR detection, where p1, p2, …, pn denote predicate attributes, d for drug, and s for symptom.
Then we need to construct the following four sets of MDC cubes, where ** denote any subset of {p1, p2, …, pn}.
z Group 1: MC[**, s, d]
z Group 2: MC[**, s]
z Group 3: MC[**, d]
z Group 4: MC[**]
For example, assume the set of predicates is {Age, Weight, Gender, Country}. In the type-1 structure, we will need to produce the following MDC cubes, where ** denotes any subset of {Age, Weight, Gender, Country}.
1. For calculating the value a in a 2×2 contingency table:
MC[**, Drug , Symptom]
Number of the MDC cubes: 24= 16
2. For calculating the value b in a 2×2 contingency table:
MC[**, Drug]
Number of the MDC cubes: 24= 16
3. For calculating the value c in a 2×2 contingency table:
MC[**, Symptom]
Number of the MDC cubes: 24= 16
4. For calculating the value d in a 2×2 contingency table:
MC[**]
Number of the MDC cubes: 24= 16
The total numbers of MDC cubes are 16+16+16+16=64. Figure 4.3 shows an example type-1 MDC cube composed of Gender, Drug, Sympton.
Drug
Figure 4-3. A type-1 MDC cube, where the cell [{d1, d2}, g1, s1] saves the count of the rule R2.
4.3.2 Type-2 Structure
As described previously, type-1 structure requires many MDC cubes and each computation of the contingency table needs to access four different MDC cubes. In order to reduce the space of the storage and shorten the access time, we develop type-2 structure. The main characteristic that makes type-2 structure different from type-1 structure is composite cell value. That is, each cube cell stores the four values, a, b, c, d of the contingency table. In this way, we only need to construct the following set of MDC cubes, MC[**, s, d], where **
denotes any subset of {p1, p2, …, pn}.
Figure 4-4. A type-2 structure MDC cube composed of three dimensions, Gender, Drug and Symptom, where the cube cell [{d1, d2}, g1, s1] stores the four values in the
corresponding contingency table for computing the measure of the rule R2.
For example, consider the previous example in Section 4.3.1. We only need to construct 24 = 16 MDC cubes. Figure 4-4 illustrates the corresponding type-2 MDC cube used for computing the measure of rule R2.
4.4 Cube Computation
In this section, we will describe how to compute the two different types of MDC cubes we propose in Section 4.3. To make our approaches easy to be implemented in most DBMS systems, we adopt string data type to represent the multivalued dimension Drug, and store each MDC cube as a table rather than multidimensional array to resolve the sparse problem.
We first present a naïve method, implementing all cube computation with SQL language, then describe our proposed more efficient algorithm, an Amortize-scan-based method. For illustration and simplifying the discussion, we assume the input data has been processed into a base table, conforming to the AERS data format, as shown in Table 4-2, where ISR denotes the report id and PT (Preferred Term) is the reaction in terms of the Medical Dictionary for Regulatory Activities (MedDRA) [21]. In this way, transactions with same ISR as a whole represents an ADR report. Note all attributes are atomic.
4.4.1 An Naive SQL Implementation
In this study, we limit the cardinality of drug combinations to at most three, because larger combinations are rarely occurred and hard to perceive. The basic task flow for computing all type-1 and type-2 MDC cubes is shown in Figure 4-5. There are four steps in this flow: (1) The base table (BT) is undergone a self-join to generate an intermediate table namely IT2-1 to store all reports involving drug combinations of length 2; (2) The base table BT and intermediate table IT2-1 are joined to generate another intermediate table, namely IT3-1, to store reports involving drug combinations of length 3; (3) The base table BT and the intermediate tables IT2-1 and IT3-1 are used to generate all type-1 MDC cubes; and (4) The
type-1 MDC cubes are processed to generate all type-2 MDC cubes. The structure of the intermediate tables and the concept of steps 1 and 2 are depicted in Figure 4-6. In what follows, we describe the details of each step.
Table 4-2. An example base table conforming to AERS data format.
ISR Age Gender Weight Country Drug PT
1 a1 g1 w1 c1 d1 s1
1 a1 g1 w1 c1 d2 s2
1 a1 g1 w1 c1 d3 s1
1 a1 g1 w1 c1 d4 s2
1 a1 g1 w1 c1 d1 s2
1 a1 g1 w1 c1 d2 s1
1 a1 g1 w1 c1 d3 s2
1 a1 g1 w1 c1 d4 s1
2 a2 g1 w2 c2 d3 s1
2 a2 g1 w2 c2 d4 s3
2 a2 g1 w2 c2 d5 s1
2 a2 g1 w2 c2 d3 s3
2 a2 g1 w2 c2 d4 s1
2 a2 g1 w2 c2 d5 s3
3 a1 g2 w3 c3 d2 s1
3 a1 g2 w3 c3 d3 s2
3 a1 g2 w3 c3 d2 s2
3 a1 g2 w3 c3 d3 s1
Figure 4-5. The basic task flow for computing MDC cubes.
Figure 4-6. The structure of the intermediate tables.
Step 1. Create intermediate table IT2-1
In this step, the base table is undergone a self-join operation to generate IT2-1, as described in the following SQL procedure.
SELECT T1.*, T2.Drug as Drug2 INTO IT2-1
FROM BT as T1 JOIN BT as T2
ON d1.ISR = T2.ISR AND T1.Drug < cast T2.Drug GROUP BY T1.*, T2.Drug
Step 2. Create intermediate table IT3-1
We executed the following SQL procedure to join tables BT and IT2-1 to generate IT3-1.
SELECT T1.*, T2.Drug as Drug3 INTO IT3-1
FROM IT2-1 as T1 JOIN BT as T2
ON T1.ISR = T2.ISR AND T2.Drug > T1.Drug2
ON T1.ISR = T2.ISR AND T2.Drug > T1.Drug2