CBM-SS method - Data Mining Approach - Previous Work for ADR Detection

Chapter 3 Previous Work for ADR Detection

3.2 Data Mining Approach

3.2.2 CBM-SS method

The core to the CBM-SS method is the concept of contingency cube, which is a variant

of OLAP cube dedicated to store the values to be used in the 2×2 contingency table. A contingency cube defined on the ADR reporting dataset is an n-dimensional cube C[A1, A2, …, An], where A1, A2, …, An-2 denote the predicate attributes, An-1 = Drug and An = PT. Except A_n−1 and An, the value set associated with each dimension Ai is domain(Ai) ∪ {*}, where domain(Ai) denotes the set of distinct values of Ai and “*” denotes “any” or “don’t care”.

Note the value sets associated with drug and PT do not include “*” because the contingency table is always associated with some specific drug and symptom. Each cell C[a1, a2, …, an] in the cube store conceptually the corresponding contingency table of Drug = an−1 and PT = an

with predicates A1 = a1, A2 = a2, …, An−2 = an−2. For example, consider the following ADR rule

R1: Gender = Female, Drug = Accutance → Symptom = abnormal baby.

The corresponding contingency table used for measuring the importance of this rule is shown in Table 3-3. And the corresponding 3-D contingency cube composed of Gender, Drug and Symptom is show in Figure 3-2.

Table 3-3. The 2×2 contingency table for identifying “Accutance - abnormal baby”

constrained by Gender = Female.

Gender=Female abnormal baby All other ADRs

Accutance a b

All other drugs c d

s4 others d1

others

d1 d2 d3 d4

g1g2 Gender

Drug

s1 s2 s3 s4 PT

Figure 3-2. An example 3-D contingency cube C(Gender, Drug, PT).

Intuitionally, the values of each cell can be obtained by aggregating the occurrences in the data warehouse. However, the reality is more complicated than we expect. The computation of values b, c, d in the contingency table involves the occurrence of negative items, which are not immediately accessible. To avoid counting through negative item, CBM-SS only store occurrences associated with positive items, and compute other values with a simple calculation. Given a contingency table composed of predicates P1 = v1, P2 = v2, …, Pk = vk, Drug = d, and PT = s, Table 3-4 details the formulas for computing all value cells a, b, c, d, where for simplicity we use count(v1, v2, …, vk, d, s) to denote the occurrences of itemset {P1 = v1, P2 = v2, …, Pk = vk, Drug = d, PT = s}. Each of the counts used in Table 3-4 indeed corresponds to a cell in the OLAP data cube. Figure 3-3 shows the correspondence between the values of the contingency table and the OLAP cube cells for the rule R1.

Table 3-4. The formulas for computing each cell in 2×2 contingency table.

Cell The formula

a count(p1, p2, … pn, d, s) b count(p1, p2, … pn, d) – a c count(p1, p2, … pn, s) – a d count(p1, p2, … pn) – a – b – c

Figure 3-3. An illustration of correspondence between contingency table and OLAP cube.

Gender =g1 PT = s1 other PTs drug = d1 count(g1, d1, s1) count(g1, d1) – a other drugs count(g1, s1) – a count(g1) – a – b – c

drug

g1 g2 *

s1 s2 s3 * d1

d2 d3

Symptom Gender

Chapter 4 Adverse Drug Interactions Detection with MDC Cub e

4.1 Problems with Contingency Cube Based Approach

As we have stated in Section 3.2, most previously proposed methods are not suitable for online analysis and detection of ADR signals, except the contingency cube based approach, CBM-SS. Nevertheless, CBM-SS can only detect ADRs caused by single drug; it may generate incorrect results when dealing with ADRs caused by drug interactions. The rationale is detailed in the following.

It is better to explain the deficiency with an example. Let us consider the data in Table 4-1, which consists of five reports, involving three attributes, Gender, Drug, and PT (Symptom). The corresponding contingency cube represented in the form of OLAP cube is shown in Figure 4-1. Now suppose we want to measure the following rule:

R2: g1, d1, d2 → s1

To do so we need to know the value count(g1, d1, d2, s1), which unfortunately is not available immediately from the OLAP cube in Figure 4-1. One possible way is to calculate

the minimum of the values stored in cells [g1, d1, s1] and [g1, d2, s1], i.e., min {count(g1, d1, s1), count(g1, d2, s1)} = 2.

So we obtain count(g1, d1, d2, s1) = 2, which is incorrect. Indeed, the correct value is 1. The main reason lead to this incorrect result is that the contingency cube provides no mechanism for retaining the aggregation contributed by non-atomic dimensions. In other words, the atomic dimensional structure of contingency cube cannot distinguish whether an accumulation is contributed, for a specific dimension, by a single or multiple dimensional values. In this example, note that Drug and PT are non-atomic dimensions. So, there is no way to know the first report containing d1 and d2 simultaneously, which contributes to the accumulation stored in cells [g1, d1, s1] and [g1, d2, s1], accounting for the extra increment in count(g1, d1, d2, s1). In light of this deficiency, we propose a new type of contingency cube, which embodies non-atomic dimensions into the cube structure. The definition and implementation detail will be elaborated in the following sections.

Table 4-1. An example dataset of ADR reports.

Tid Gender Drug PT 1 g1 d1,d2,d3 s1

2 g2 d2,d3 s2

3 g1 d2, d3 s1, s2

4 g1 d1, d3 s1, s3

5 g2 d2 s1

d1 d2 d3

* g1 g2 Gender

s1 s2 s3 * Symptom

Figure 4-1. The corresponding 3-D contingency cube in OLAP cube format for the example in Table 4-1.

4.2 Definition of MDC Cube

In light of the deficiency of contingency cube, we propose the new type of contingency cube, MDC cube, which embodies the concept of multivalued dimension, i.e., the member of a dimension can be a set of items (multivalues). Formally, a MDC cube defined on the ADR reporting dataset is an n-dimensional cube MC[A1, A2, …, An], where A1, A2, …, An-2 denote the predicate attributes, An−1 = Drug and An = PT. Except An−1 and An, the value set associated with each dimension Ai is domain(Ai) ∪ {*}, where domain(Ai) denotes the set of distinct values of Ai and “*” denotes “any” or “don’t care”. The domain for PT is the same as that in the contingency cube, Drug is a multivalued dimension. Specifically, let D be the set of all drugs. Then the domain of Drug dimension is equal to the power set of D except the empty set, i.e., domain(Drug) = 2^D – ∅. MDC cube again store conceptually the corresponding contingency table of Drug = an−1 and PT = an with predicates A1 = a1, A2 = a2, An−2 = an−2. But note that a_n−1 here may be a set of drugs, {d1, d2, …}. Let us consider the example in Table 4-1 again. The corresponding 3-D MDC cube is depicted in Figure 4-2, where the cell [g1, {d1, d2}, s4] stores the contingency table essential for computing the following ADR rule:

R3: Gender = g1, Drug = {d1, d2} → Symptom = s4.

Figure 4-2. The corresponding MDC Cube of the data set in Table 4-1.

4.3 Structure Implementation of MDC Cube

In this section, we will present how to implement the concept of MDC cube using the contemporary database technology. The primary concern is multivalued dimension and the cube cell. We propose two different types of representations, based on the strategies for implementing the multivalued dimension and the cube cell. For differentiation, we call them type-1 structure and type-2 structure.

4.3.1 Type-1 Structure

The first implementation, type-1 structure, follows the OLAP cube implementation for contingency cube, with the extension of using multivalued-type attribute to represent multivalued dimension. Most database systems provide string data type, which suffices for

the purpose. Some object-relational DBMSs, such as Oracle and DB2, even support collection (set) data type, which provides a more convenient index scheme. Since type-1 structure preserves the two features of OLAP cube, positive value dimension and atomic cell value, the values a, b, c, d in the contingency table are computed in the same way as described in Table 3-4. Let {p1, p2, …, pn, d, s} be the set of attributes involved in ADR detection, where p1, p2, …, pn denote predicate attributes, d for drug, and s for symptom.

Then we need to construct the following four sets of MDC cubes, where ** denote any subset of {p1, p2, …, pn}.

z Group 1: MC[**, s, d]

z Group 2: MC[**, s]

z Group 3: MC[**, d]

z Group 4: MC[**]

For example, assume the set of predicates is {Age, Weight, Gender, Country}. In the type-1 structure, we will need to produce the following MDC cubes, where ** denotes any subset of {Age, Weight, Gender, Country}.

1. For calculating the value a in a 2×2 contingency table:

MC[**, Drug , Symptom]

Number of the MDC cubes: 2⁴= 16

2. For calculating the value b in a 2×2 contingency table:

MC[**, Drug]

Number of the MDC cubes: 2⁴= 16

3. For calculating the value c in a 2×2 contingency table:

MC[**, Symptom]

Number of the MDC cubes: 2⁴= 16

4. For calculating the value d in a 2×2 contingency table:

MC[**]

Number of the MDC cubes: 2⁴= 16

The total numbers of MDC cubes are 16+16+16+16=64. Figure 4.3 shows an example type-1 MDC cube composed of Gender, Drug, Sympton.

Drug

Figure 4-3. A type-1 MDC cube, where the cell [{d1, d2}, g1, s1] saves the count of the rule R2.

4.3.2 Type-2 Structure

As described previously, type-1 structure requires many MDC cubes and each computation of the contingency table needs to access four different MDC cubes. In order to reduce the space of the storage and shorten the access time, we develop type-2 structure. The main characteristic that makes type-2 structure different from type-1 structure is composite cell value. That is, each cube cell stores the four values, a, b, c, d of the contingency table. In this way, we only need to construct the following set of MDC cubes, MC[**, s, d], where **

denotes any subset of {p1, p2, …, pn}.

Figure 4-4. A type-2 structure MDC cube composed of three dimensions, Gender, Drug and Symptom, where the cube cell [{d1, d2}, g1, s1] stores the four values in the

corresponding contingency table for computing the measure of the rule R2.

For example, consider the previous example in Section 4.3.1. We only need to construct 2⁴ = 16 MDC cubes. Figure 4-4 illustrates the corresponding type-2 MDC cube used for computing the measure of rule R2.

4.4 Cube Computation

In this section, we will describe how to compute the two different types of MDC cubes we propose in Section 4.3. To make our approaches easy to be implemented in most DBMS systems, we adopt string data type to represent the multivalued dimension Drug, and store each MDC cube as a table rather than multidimensional array to resolve the sparse problem.

We first present a naïve method, implementing all cube computation with SQL language, then describe our proposed more efficient algorithm, an Amortize-scan-based method. For illustration and simplifying the discussion, we assume the input data has been processed into a base table, conforming to the AERS data format, as shown in Table 4-2, where ISR denotes the report id and PT (Preferred Term) is the reaction in terms of the Medical Dictionary for Regulatory Activities (MedDRA) [21]. In this way, transactions with same ISR as a whole represents an ADR report. Note all attributes are atomic.

4.4.1 An Naive SQL Implementation

In this study, we limit the cardinality of drug combinations to at most three, because larger combinations are rarely occurred and hard to perceive. The basic task flow for computing all type-1 and type-2 MDC cubes is shown in Figure 4-5. There are four steps in this flow: (1) The base table (BT) is undergone a self-join to generate an intermediate table namely IT2-1 to store all reports involving drug combinations of length 2; (2) The base table BT and intermediate table IT2-1 are joined to generate another intermediate table, namely IT3-1, to store reports involving drug combinations of length 3; (3) The base table BT and the intermediate tables IT2-1 and IT3-1 are used to generate all type-1 MDC cubes; and (4) The

type-1 MDC cubes are processed to generate all type-2 MDC cubes. The structure of the intermediate tables and the concept of steps 1 and 2 are depicted in Figure 4-6. In what follows, we describe the details of each step.

Table 4-2. An example base table conforming to AERS data format.

ISR Age Gender Weight Country Drug PT

1 a1 g1 w1 c1 d1 s1

1 a1 g1 w1 c1 d2 s2

1 a1 g1 w1 c1 d3 s1

1 a1 g1 w1 c1 d4 s2

1 a1 g1 w1 c1 d1 s2

1 a1 g1 w1 c1 d2 s1

1 a1 g1 w1 c1 d3 s2

1 a1 g1 w1 c1 d4 s1

2 a2 g1 w2 c2 d3 s1

2 a2 g1 w2 c2 d4 s3

2 a2 g1 w2 c2 d5 s1

2 a2 g1 w2 c2 d3 s3

2 a2 g1 w2 c2 d4 s1

2 a2 g1 w2 c2 d5 s3

3 a1 g2 w3 c3 d2 s1

3 a1 g2 w3 c3 d3 s2

3 a1 g2 w3 c3 d2 s2

3 a1 g2 w3 c3 d3 s1

Figure 4-5. The basic task flow for computing MDC cubes.

Figure 4-6. The structure of the intermediate tables.

Step 1. Create intermediate table IT2-1

In this step, the base table is undergone a self-join operation to generate IT2-1, as described in the following SQL procedure.

SELECT T1.*, T2.Drug as Drug2 INTO IT2-1

FROM BT as T1 JOIN BT as T2

ON d1.ISR = T2.ISR AND T1.Drug < cast T2.Drug GROUP BY T1.*, T2.Drug

Step 2. Create intermediate table IT3-1

We executed the following SQL procedure to join tables BT and IT2-1 to generate IT3-1.

SELECT T1.*, T2.Drug as Drug3 INTO IT3-1

FROM IT2-1 as T1 JOIN BT as T2

ON T1.ISR = T2.ISR AND T2.Drug > T1.Drug2 GROUP BY T1.*, T2.Drug

Step 3. Create type-1 MDC Cubes

In this step, we perform the following SQL procedures to generate the four different groups of type-1 MDC cubes, where ** denotes any subset of attributes in {Age, Weight, Gender, Country}.

// Create MC[**, Drug, PT] MDC cubes SELECT {**}, Drug, PT, count (*) AS count INTO **_ Drug_ PT

FROM BT

GROUP BY **, Drug, PT INSERT INTO **_ Drug_ PT

SELECT {**}, (Drug+ Drug2) AS Drug, PT, count (*) AS count FROM IT2-1

GROUP BY **, Drug, Drug2, PT

INSERT INTO **_ Drug_ PT

SELECT {**}, (Drug+ Drug2 +Drug3) AS Drug, PT, count (*) AS count INTO **_ Drug_ PT

FROM IT3-1

GROUP BY **, Drug, Drug2, Drug3, PT

// Create MC[**, Drug] MDC cubes SELECT {**}, Drug, count (*) AS count INTO **_ Drug

FROM BT

GROUP BY **, Drug

INSERT INTO **_ Drug

SELECT {**}, (Drug+ Drug2) AS Drug, count (*) AS count FROM IT2-1

GROUP BY **, Drug, Drug2

INSERT INTO **_ Drug

SELECT {**}, (Drug+ Drug2 +Drug3) AS Drug, count (*) AS count INTO **_ Drug

FROM IT3-1

GROUP BY **, Drug, Drug2, Drug3

// Create MC[**, PT] MDC cubes

INTO **_PT FROM BT

GROUP BY **, PT

// Create MC[**] MDC cubes SELECT {**}, count (*) AS count INTO **

FROM BT GROUP BY **

Step4. Creating type-2 MDC Cubes

Recall that the type-2 MDC cubes saves the counts a, b, c and d in each cell, which indeed can be computed by aggregating other type-1 MDC cubes. To this end, we define and execute the following SQL procedures to create b, c and d counts. Note that we do not need to count

“a” because it is stored in type-1 MDC cubes with attributes of {**, Drug, PT}.

// Create count a

SELECT **, Drug, count AS a INTO **_ Drug_ PT_type2 FROM **_ Drug_ PT

// Create count b

INSERT INTO **_ Drug_ PT_type2 SELECT (T2.count – T1.count) AS b

FROM **_Drug_PT AS T1, **_Drug AS T2

WHERE T1.** = T2.** AND T1.Drug = T2.Drug

// Create count c

INSERT INTO **_ Drug_ PT_type2 SELECT (T2.count – T1.count) AS c FROM **_Drug_PT AS T1, **_PT AS T2 WHERE T1. ** = T2. ** AND T1.PT = T2.PT

// Create count d

INSERT INTO **_ Drug_ PT_type2 SELECT (T1.count –T2.count- b-c) AS d FROM ** AS T1, **_Drug_PT AS T2 WHERE T1. ** = T2. **

Example 4-1:

In this example, we will use the base table BT in Table 4-3, which is based on Table 4-1, to create the corresponding type-1 and type-2 MDC cubes.

First, the result after Step 1: Creating intermediate table IT2-1 is shown in Table 4-4.

Note that there is no transaction with ISR = 5 because of single PT. The result after Step 2:

Creating intermediate table IT3-1 is shown in Table 4-5, which consists of only one transaction with ISR = 1.

Next, Step 3: Creating type-1 MDC Cubes is executed, generating a “count” field and using the “string” data type to realize the multivalued Drug dimension. For simplicity and space limitation, we only show the results with Gender = g1. All of the four groups of type-1

MDC cubes generated from the BT, IT2-1 and IT3-1, are shown in Tables 4-6, 4-7, 4-8, and 4-9.

Table 4-3. The relational BT derived from Table 4-1.

ISR Gender Drug PT 1 g1 d1 s1

Table 4-4. The resulting IT2-1 generated from Table 4-3.

ISR Gender Drug Drug2 PT 1 g1 d1 d2 s1

Table 4-5. The resulting IT3-1 generated from Table 4-3.

ISR Gender Drug Drug2 Drug3 PT 1 g1 d1 d2 d3 s1

Table 4-6. Part of the corresponding type-1 MDC cube MC[Gender, Drug, PT]

generated from Table 4-3, with Gender = g1.

Gender Drug PT count

Table 4-7. Part of the type-1 MDC cube MC[Gender, Drug] generated from Table 4-3 with Gender = g1.

Table 4-8. Part of the type-1 MDC cube MC[Gender, PT] generated from Table 4-3 with Gender = g1.

Gender PT count

g1 s1 3 g1 s2 1 g1 s3 1

Table 4-9. Part of the type-1 MDC cube MC[Gender] generated from Table 4-3 with Gender = g1.

Gender count

g1 3

Finally, the type-2 MDC cubes are generated by executing Step 4, primarily focusing on computing the count values of b, c and d. For simplicity, let us consider the rule R2. Its corresponding value for count a is immediately available from Table 4-6, and we have a = 1.

i.e., the cell [g1, (d1, d2), s1, count = 1]. Next, by reading the corresponding cell [g1, (d1, d2)]

in Table 4-7 we obtain the value for count b, that is count[g1, (d1, d2)] − a = 1 − 1 = 0.

Similarly, from Table 4-8, we obtain the value for count c, that is, count[g1, s1] − a = 3 − 1 = 2, and from Table 4-9, we obtain the value for count d, that is, count[g1] – a − b – c = 3 – 1 – 0 – 2 = 0. The final resulting count values for measuring rule R3 are a = 1, b = 0, c = 2 and d = 0, which is saved in the type-2 MDC cube MC[Gender, Drug, PT], as shown in Table 4-10.

Table 4-10. Part of the type-2 MDC cube MC[Gender, Drug, PT] generated from Table 4-3 with Gender = g1.

Gender Drug PT a b c d

g1 d1, d2 s1 1 0 2 0

4.4.2 A Faster Implementation

The SQL procedures we described in Section 4.4.1 for creating the MDC cubes are easy to implement but consume lots of time. Firstly, the processes for generating IT2-1 and IT3-1 (steps 1 and 2) relies on the SQL join operation, which as we will demonstrate later is very inefficient. Secondly, the procedures executed in step 3 for computing all type-1 MDC cubes induce many times of I/O accesses. In fact, for an ADR report data consisting of n predicate attributes used for adverse drug interaction detection, the number of times to access table BT, IT2-1 and IT3-1 are 4×2ⁿ, 2ⁿ⁺¹ and 2ⁿ⁺¹, respectively, a very large amount as n increases.

In this section, we thus present a more efficient approach to accomplish the time-consuming steps 1 to 3. Specifically, we propose a faster version of join procedure to speed up the execution of steps 1 and 2, and a very efficient algorithm based on the concept of Amortize scan to reduce unnecessary times of accessing the some tables. Below we detail the design of each approach.

(1) Faster approach for intermediate tables generation

The primary problem for using SQL join operation is that it spends unnecessary time to repeatedly check the matching of two records, i.e., with the same ISR. Recall that the table format adopted by AERS systems, in order to satisfy the first normal form required by relational DBMS, has to divide an ADR report into several records with the same ISR. For example, see Table 4-1. The first report ISR=1 has been divided into three records in BT because the patient took three different drugs. Although the records in BT have been sorted, with records of the same ISR grouping together, the standard join operation still has to perform 3×3=9 matching inspections (because there are six different drug-drug combinations

in table IT2-1) to accomplish the generation of table IT2-1, and 3×6=18 matching inspections for generating table IT3-1.

For this reason, we develop a new algorithm for accomplishing steps 1 and 2. The basic idea is when we have read all the records of the same ISR, we create two lists to temporarily store all drugs and PTs of this report. Then for each PT we just enumerate all cardinality-2 and cardinality-3 drug combinations, and store the results into tables IT2-1 and IT3-1, respectively.

This approach not only eliminates a large amount of matching inspections, but also reduces the access of table BT into once. A description of this algorithm is presented in Figure 4-7.

(2) An efficient approach for implementing step 3

As we mentioned earlier in this section, the main problem of SQL implementation is the repeated access of tables BT, IT2-1, and IT3-1. To solve the problem, we adopt the concept of Amortize-scans [26], computing as many groupbys as possible at the same time to amortize disk reads. That is, we read only one time of BT into memory and use it to compute all type-1 MDC cubes. The same approach is applied to the case for computing type-1 MDC cubes from IT2-1 and IT3-1.

4.5 MDC Cube-Based Detection Methods

In this section, we describe the methods for detecting adverse drug interactions from the stored type-1 or type-2 MDC cubes. To simplify the discussion, we assume the analyzer has issued an ADR detection query in the following SQL-like format through some system that embodies our approaches.

Input: Base table BT, which has been sorted by ISR Output: Intermediate tables IT 2-1 and IT3-1 Steps:

1. while (not end of BT) 2. read the next record R;

3. if R.ISR = prevISR then

4. store R.Drug into drug-list and R.PT into PT-list;

5. else

6. for each cardinality-2 or cardinality-3 drug combinations drugset from drug-list do

7. for each PT in PT-list do

8. output prevISR, other attribute values, drugset, PT into IT2-1 or IT3-1;

9. endfor 10. prevISR = R.ISR;

11. endif 12. endwhile

Figure 4-7. Algorithm description of the proposed method for creating intermediate

tables IT2-1 and IT3-1.

在文檔中多維度值資料方體於藥物交互作用不良反應之偵測 (頁 19-0)