Rule Mining Engine - A Knowledge Discovery Platform for Analyzing and Detecting Adverse Drug

Chapter 3 A Knowledge Discovery Platform for Analyzing and Detecting Adverse Drug

3.6 Rule Mining Engine

For the users, there are many interested attributes in addition to drug and symptom. The attributes are the patient’s age, gender, weight, the year, and the country.

We let users select the interested mining attributes from these attributes, and then generate signals of ADR associated with these mining attributes by using the proposed algorithms.

Detecting signals of ADRs from spontaneous reporting databases is not an easy task because of two problems. First, data is huge. The spontaneous reporting database, such as AERS, grows very rapidly. There are more than 300,000 new reports generated each year. Second, the frequency threshold used for most data mining techniques is less significant. A signal that only occurs three or more times is regarded as noticeable.

In this context, the threshold of occurrence exhibits less pruning power, unlike the minimum support threshold used in classical association rule mining. Furthermore, the confidence measure used in evaluating the importance of association rule is less

meaningful. Instead the measures of disproportionality are used. In our work, we adopt four measurements commonly used in different methods of pharmacovigilance, including PRR, ROR, IC and Leverage. These measures of proportionality are based on the 2×2 contingency table, as described in section 2.4. Notice that the mining attributes corresponding to the antecedent of rules include not only drug attribute but also other attributes. Therefore we extend the 2×2 contingency table, as shown in Table 3.5. The four measures of disproportionality are refined in Table 3.6. In summary, if an ADR rule that associates mining attributes, drugs, and symptoms occurs more than two times and its disproportionality is more than the threshold, this rule represents a signal for a suspected ADR. Our preliminary experiments on using existing techniques for this problem exhibit very poor performance; each process of mining consumes lots of execution times and a large amount of signals are extracted.

In light of above observations, although some of contemporary methods are good for identifying the signals of ADRs, they are not qualified for use in establishing an interactive environment for ADR detection and analysis. In this thesis, we tackle this problem from a systematic and unified view. Aiming at building an ADR detection and analysis system, we consider four different forms of ADR rules and propose four methods for extracting them, namely CBM-SS, CBM-SM, CBM-MM, and ABCM-MS.

The details of these methods are described in Chapter 4.

Table 3.5. Extension of 2×2 contingency table.

Suspected

Table 3.6. The four measures of proportionality used for identifying ADRs.

*SE: standard error, SD: standard deviation

Chapter 4 Methods for Signal Detection of Adverse Drug Reactions

4.1 Rule Form

An association rule can be expressed as Antecedent → Consequence, where Antecedent and Consequence are sets of items. In the context of ADR, the Antecedent corresponds to a set of items consisting of drug and other selected mining attributes, and Consequence is composed of one or more symptoms. In some situation, we must also consider drug-drug interactions and syndrome (except single drug causes single symptom). In this section, we consider four relationships between drugs and symptoms.

They are (1) single drug and single symptom, (2) multiple drugs and single symptom, (3) single drug and multiple symptoms, and (4) multiple drugs and multiple symptoms.

(1) Single drug and single symptom: It represents the ADR involving single symptom caused by single drug. The rule can be expressed as follows:

Attribute1, Attribute2, ..., Attributen, D1 → S1

where Attribute1, Attribute2, ..., Attributen are a set of mining attributes, D1 means a single drug and S1 is a symptom. For instance, below is a single drug and single symptom association rule:

Age = “>60”, Gender = ”Female”, Drug = ”VIOXX” → Symptom = “Myocardial

infarction”

Indeed, most of previous work focused on a simplified form of this type of rule that express drug-symptom association without considering other factors. We propose an algorithm, namely CBM-SS, for identifying this type of ADR rule efficiently.

The detailed description of CBM-SS is shown in section 4.2.2.

(2) Multiple drugs and single symptom: It represents the ADR involving single symptom caused by drug-drug interactions. The form can be expressed as follows:

Attribute1, Attribute2, ..., Attributen, D1, D2, ..., Dn → S1

where Attribute1, Attribute2, ..., Attributen are a set of mining attributes, D1, D2, ..., Dn denote drugs used together and S1 is a single symptom. A simple description of drug-drug interaction is that the effects of a drug are changed by the presence of another drug. In general, single drug is safe. However, when drugs are used together, they may interact and cause side effects. For instance, a multiple drugs and single symptom association rule can be represented as:

Age = ”>60”, Drug1 = ”NSAIDs”, Drug2 = ”Corticosteroids” → Symptom = “Peptic ulcer”.

We also propose an algorithm, ABCM-MS, which is a modified version of CMAR algorithm [15]. The detailed description of ABCM-MS is presented in section 4.3.

(3) Single Drug and multiple symptoms: The objective of this type of rule is to analyze

the clustering of symptoms in order to determine whether there is a signal for a syndrome. It can be expressed as follows:

Attribute1, Attribute2, ..., Attributen, D1 → S1, S2, ..., Sn

where Attribute1, Attribute2, ..., Attributen are a set of mining attributes, D1 denotes a single drug and S1, S2, ..., Sn represent a cluster of symptoms, such as syndrome.

For instance, below is a single drug and multiple symptoms association rule can be represented as:

Drug1 = ”Terbinafine” → Symptom1 = ”Arthralgia”, Symptom2 = ”Fever”, Symptom3 = ”Urticaria”.

We propose an algorithm for identifying this type of signals, namely CBM-SM, which is based on CBM-SS and is described in section 4.2.3.

(4) Multiple drugs and multiple symptoms: The extension of drugs interaction and syndrome. It can be expressed as follows:

Attribute1, Attribute2, ..., Attributen, D1, D2, ..., Dn → S1, S2, ..., Sn

where Attribute1, Attribute2, ..., Attributen are a set of mining attributes, D1, D2, ..., Dn denote drugs used together and S1, S2, ..., Sn represent a cluster of symptoms, such as syndrome. For instance, a multiple drugs and multiple symptoms association rule can be represented as:

Drug1 = ”Valproic acid”, Drug2 = ”Phenobarbital” → Symptom1 = ”Liver disorder”, Symptom2 = ”Renal failure acute”.

We propose an algorithm, namely CBM-MM, which is similar to CBM-SM and is described in section 4.2.4.

4.2 Cube Based Methods

Since a spontaneous reporting database usually contains a very large amount of data, how to facilitate interactive detection and analysis of ADRs is a challenging task.

Inspired by the success of OLAP operations that can respond to user queries immediately and the techniques for data cube processing are immediately supported by contemporary data warehousing systems, we adopt the OLAP cube technique into the establishment of our ADRs mining methods. The crucial issue is how to select suitable data cubes to store in advance such that it can speed up the calculation of measures.

The definitions of the measures of disproportionality provide the answer: all of them are derived from the contingency table. In light of this, we propose the concept of contingency cubes to refer to the set of pre-stored data cubes for ADR detection.

Although such a framework must cost external amount of storage space, users can get fast and accurate results. In the following subsections, we will describe the concept of contingency cubes and the proposed cube-based algorithms.

4.2.1 Contingency Cube

Before we introduce the concept of contingency cube, we have to first tackle an important issue related to the many-to-many relationships existing in the ADR star. A report sent to the spontaneous reporting database contains information about one or

more suspected drugs and symptoms. This implies that the relationships between the suspected drugs and symptoms are undistinguished, e.g., a symptom may be caused by a certain drug or drug-drug interaction. To fit into the data cube structure we have to disintegrate the report to single drug and single symptom pairs. For example, a report records two suspected drugs and three suspected symptoms will be partitioned into six records. Besides, the associations between other risk patterns and symptoms must also be considered. Therefore, we generate a base cube that stores all mining attributes, drug, and symptom pairs from reports by performing the following SQL statement:

SELECT Demo.key, Year, Age, Gender, Weight, Country, Drug, PT FROM DW

GROUP BY Demo.key, Year, Age, Gender, Weight, Country, Drug, PT

where DW refers to the ADR data warehouse, PT (Preferred Term) is the attribute records the symptoms. Note that in this base cube there is no measure attribute. This is because each record is singular and so its count is always one, which can be omitted.

Example 4.1. Consider an ADR warehouse with six reports, as shown in Figure 4.1.

The resulting base cube BC generated from the six reports is shown in Table 4.1, where Year, Age, Gender, Weight, and Country are mining attributes while Drug and PT represent drug and symptom.

Figure 4.1. A simple ADR data warehouse with six reports.

For ADR signaling rules, the measure values of proportionality must be calculated based on 2×2 contingency table. Intuitionally, the values of each cell can be obtained by aggregating the count in the base cube. However, since the relationship between Drug and PT attributes is many-to-many, if the count is aggregated directly from base cube, it will be calculated repeatedly and does not coincide with the facts. Thus, first of all, we must distinguish the uniqueness of drugs and symptoms, and then calculate the count. The detailed formulas are shown in Table 4.2, where count(mining attributes, drug, symptom) refers to the number of occurrences of the itemset forming a rule:

mining attributes, drug → symptom. For example, for a rule: Age = a2, Weight = w2, Drug = d1 → PT = s1, the itemset forming this rule occurs two times in the base cube,

Table 4.1. A base cube BC.

Demo_key Year Age Gender Weight Country Drug PT

1 y3 a2 g1 w2 c3 d1 s1

1 y3 a2 g1 w2 c3 d2 s1

1 y3 a2 g1 w2 c3 d3 s1

2 y1 a1 g2 w2 c1 d2 s2

2 y1 a1 g2 w2 c1 d3 s2

3 y2 a2 g2 w2 c2 d1 s1

3 y2 a2 g2 w2 c2 d3 s1

3 y2 a2 g2 w2 c2 d1 s2

3 y2 a2 g2 w2 c2 d3 s2

4 y1 a2 g1 w1 c3 d1 s1

4 y1 a2 g1 w1 c3 d3 s1

5 y1 a1 g2 w1 c2 d2 s1

6 y3 a2 g1 w2 c1 d2 s3

Table 4.2. The formula of each cell in 2×2 contingency table.

The cell of

contingency table The formula

a count(mining attributes, drug, symptom) b count(mining attributes, drug) – a

c count(symptom) – a

d N – a – b – c

*N: the number of reports.

i.e., count(a2, w1, d1, s1) = 2. In short, the concept of contingency cube is a data cube

with dimensions constructed from the attributes corresponding to mining attributes, drug, and symptom, in which each cell stores the occurrences of the itemset forming a possible rule. These aggregated cubes are defined as contingency cubes.

Note that Drug and PT are two mandatory attributes that must be included in the dimensions of the contingency cube used for calculating the value a in a 2×2 contingency table. Since there are five demographic mining attributes in BC that are optional dimensions for constructing contingency cube, we totally can construct 2⁵= 32 different contingency cubes. The relationships of these 32 cubes can be expressed as a lattice that is shown in Figure 4.2. In addition, since a rule must occur no less time than the threshold, only the cells in a contingency cube with count over the specified threshold have to be stored. In this context, the resulting contingency cube can be regarded as an iceberg cube [3]. The corresponding SQL statement for generating such a contingency cube from the base cube is shown as follows:

SELECT *, Drug, PT, count(*) AS count FROM BC

GROUP BY *, Drug, PT

HAVING count(*)≧frequency threshold

where * can be replaced by any subset of five demographic mining attributes and BC represents the base cube.

Figure 4.2. A lattice of contingency cubes.

As to calculate value b, it can be regarded as subtracting the occurrences of the mining attributes, drug, and symptom together from the occurrences of mining attributes and drug together. When calculating the value b, the Drug attribute is mandatory and so we have in total 2⁵= 32 different contingency cubes with dimensions corresponding to mining attributes. The occurrences of mining attributes and drug together can be acquired by these contingency cubes. However, if these contingency cubes are generated from the base cube in the same way as that for calculating value a, duplicate occurrences will be accounted since we have disintegrated drugs and symptoms. Therefore, we must distinguish a drug in a report only occurs once. The corresponding SQL statement for generating the contingency cubes to store the occurrences of mining attributes and drug together is shown as follows:

SELECT *, Drug, count(*) AS count

GROUP BY *, Drug

HAVING count(*) ≥ frequency threshold

The value c can be expressed as subtracting the occurrences of mining attributes, drug and symptoms together from the occurrences of the suspected symptom. Only one contingency cube has to be constructed to store the occurrences of all symptoms. The SQL statement is shown as follows:

SELECT PT, count(*) AS count

FROM ( SELECT DISTINCT PT FROM BC ) GROUP BY PT

HAVING count(*) ≥ frequency threshold

The value d is simply calculated by subtracting a, b, and c from the total number of reports. Finally, we can use these cell values to calculate the measure value for signal detection. All possible contingency cubes, in total 65, are generated in advance and stored in the off-line process. In the on-line ADRs detection process, the expensive computation of generating suitable contingency cubes can thus be omitted.

4.2.2 Single Drug and Single Symptom Detection

The first proposed algorithm, namely CBM-SS (Cube Based Mining for Signal Detection of Single Drug and Single Symptom), employs contingency cubes to generate signals of ADRs involving single symptom caused by single drug with/without other demographic mining attributes.

The CBM-SS algorithm consists of four phases: (1) Contingency cubes selection

phase; (2) Rule generation phase; (3) Measure calculation phase; and (4) Signal ranking and top-k outputting phase. In the following, we describe each of the four phases. An example is also given to illustrate the CBM-SS algorithm.

(1) Contingency cubes selection phase

According to the mining attributes that users selected, the proposed approach selects the suitable contingency cubes from the stored contingency cubes and loads them into memory.

(2) Rule generation phase

The main task of this phase is to generate all rules from the selected contingency cubes. The proposed approach first lists every pair of the selected mining attributes, drug, and symptom exhaustively. That is, each row of the contingency cube Ca with dimensions composed of Drug, PT, and the user specified demographic mining attributes will generate a corresponding rule for this purpose. The algorithmic description of this phase is described in Figure 4.3.

Input: Contingency table Ca; Output: A set of rules R;

Steps:

1. R = ∅;

2. for each row r = {α₁, α₂, ..., α_n, d₁, s₁} of Ca do 3. Let x = α1∧α₂∧...∧α_n∧d₁;

4. Let y= s₁;

5. Form a rule a = x→y;

6. R = R ∪ a;

7. end for

Figure 4.3. Algorithmic description of rule generation phase.

(3) Measure calculation phase

In this phase, the proposed approach will inspect signals whose measure passes the measurement threshold. For each generated rule, four values in the 2×2 contingency table, including a, b, c, and d, are calculated. Then, according to the selected measurement, compute the measure value of each rule and check whether it passes the threshold or not.

(4) Signal ranking and top-k outputting phase

The proposed approach sorts all signals by their measure values and outputs top-k signals for users in this phase. The parameter k is specified by users.

Example 4.2 Continuing with Example 4.1, assume that a query Q includes the selected mining attribute Age and the selected measure PRR is specified. In other words, the antecedent of generated rules must have age and drug information and the consequent is symptom. Let the frequency threshold be 3. The execution of the proposed CBM-SS algorithm is illustrated step by step. In contingency cubes selection phase, it selects the suitable contingency cubes for query Q, i.e., Ca = 〈Age, Drug, PT〉, Cb = 〈Age, Drug〉, and Cc = 〈PT〉. These contingency cubes are shown in Figures 4.4, 4.5 and 4.6.

Figure 4.4. The contingency cube Ca.

Figure 4.5. The contingency cube Cb.

Figure 4.6. The contingency cube Cc.

In the rules generation phase, each row of the contingency cubes Ca will be issued as a rule. For example, the rule generated corresponding to the first row of Ca is:

Age = a2, Drug = d1 → Symptom = s1

Next in the signal generation phase the measure of each generated rule is calculated and inspected whether it passes the threshold or not. The results are shown in Table 4.3.

Table 4.3. The results after signals generation phase.

Rule a b c d PRR PRR -1.96SE>1

Age = a2, Drug = d1 → Symptom = s1 3 0 1 2 3 Yes Age = a2, Drug = d3 → Symptom = s1 3 0 1 2 3 Yes

Finally, it sorts all signals according to their measure values and outputs top-k signals for users in the last phase. The results are shown in Table 4.4.

Table 4.4. The result of signals.

No. Signal PRR 1 Age = a2, Drug = d1 → Symptom = s1 3

2 Age = a2, Drug = d3 → Symptom = s1 3

4.2.3 Single Drug and Multiple Symptoms Detection

In this section, we will describe another contingency cube based algorithm, namely CBM-SM (Cube Based Mining for Signal Detection of Single Drug and Multiple Symptoms). This algorithm is an extension of CBM-SS that is suitable for detecting ADR signals involving single drug and multiple symptoms. This kind of rules is similar to the hybrid association rule in multi-dimensional association rule mining [17]. In CBM-SM, the contingency cubes still need to be stored in advance. The main differences between CBM-SM and CBM-SS are:

(1) The Apriori-join approach [1] is employed to generate candidate itemsets and different calculating methods are used for support counting of itemsets.

(2) More contingency cubes should be generated and pre-stored. Indeed, all attributes are optional and so there are totally 127 (= 2⁷-1) contingency cubes needed to be generated.

These contingency cubes can be generated by roll-up operation from base cube directly. The corresponding SQL statement is expressed as follows:

SELECT Demo_key, *, count(*) FROM BC

GROUP BY Demo_key, *

where * can be replaced by any subsets of seven attributes that include Year, Age, Gender, Weight, Country, Drug and PT.

The CBM-SM consists of five main phases: (1) Initial candidate generation phase;

(2) Frequent itemset generation phase; (3) New candidate generation phase; (4) Rule generation phase; and (5) Signal generation phase. Phases 2 and 3 are executed recursively until no itemsets are generated. In the following, we will describe each of the five phases in detail. An example is also given to illustrate the CBM-SM algorithm.

(1) Initial candidate generation phase

In the first phase, CBM-SM exploits the necessary attributes for forming ADR rules, i.e., Drug and PT. Initial candidate 1-itemsets are then constructed from the selected attributes directly.

(2) Frequent itemset generation phase

The purpose of this phase is to count every candidate itemset and prune infrequent itemsets. For each itemset in the set of candidate k-itemset Ck, it first finds the corresponding contingency cube with dimensions comprised by the same attributes extracted from the itemset. For instance, 〈Demo_key, Age, Drug〉 is the corresponding contingency cube for candidate itemset {a1, d1}. Its count is then accumulated according to the criterion that is if items of a candidate itemset occur together in a row in the corresponding contingency cube, the count of this candidate itemset pulses one.

The algorithm of this phase is described in Figure 4.7.

Input: Candidate 1-itemsets Ck;

Set of contingency cubes <_P(A1, A2, ..., An), Drug, PT>, where _P(A1, A2, ..., An) denote the power set of user specified demographic attributes.

Output: Frequent k-itemsets Fk. Steps:

Figure 4.7. Algorithm description of frequent itemset generation phase.

(3) New candidate generation phase

In this phase, the new set of candidate (k+1)-itemsets Ck+1 will be generated from frequent itemsets Fk. The algorithm of this phase is shown in Figure 4.8.

In lines 1 to 5, for any two itemsets I1 and I2 in the set of frequent k-itemsets Fk, if k-1 items in I1 and I2 are the same but the last items of them are different and belong to different attributes except PT attribute, they can form a candidate (k+1)-itemset c.

Then, in lines 6 to 11, a candidate (k+1)-itemset c is deleted if it satisfies either of the two conditions: (1) some k-subsets of c are not belonging to frequent k-itemsets Fk; (2)

except attribute PT, there exist at least two items of c belonging to the same attribute.

在文檔中藥物不良反應成因的分析與偵測之知識發掘平台 (頁 38-0)