Co-training learning - Related Work - 利用合作訓練與集成學習法檢測藥物不良反應事件通報系統中之重複記錄

Chapter 3 Related Work

3.2 Classification

3.2.2 Co-training learning

Co-training was introduced by Blum and Mitchell in 1998 [4] to build learning models from the data set with very small amounts of labeled examples and large amounts of unlabeled data. The original model was developed based on the assumption that there exist two different views (characterization or feature sets of data) to classify the data and these two views are conditionally independent. For example, a web page can be classified according to the words occurring at that page (one view) or the words occurring in hyperlinks pointing to that page (another view).

The co-training learning builds models in the following way. Initially, each class of features is used to build classifier from the labeled data, resulting in two separate classifiers. In subsequent iterations, the label data is augmented with unlabeled data whose selection and labels are determined corporately according to the results of the two classifiers. Then the two classifiers are retrained using the expanding training data.

The entire process continues until the performance of the classifiers converges or no new unlabeled dataset can be selected. In short, co-training uses the unlabeled data to bootstrap the classifiers, helping them to achieve better classification results.

Rationale behind the idea of co-training is intuitive. A small amount of training data to build the model is not necessarily representative of the entire data set.

Judiciously adding appropriate unlabeled data to the training data hence will enhance the representation for the entire population.

Since the work of Blum and Mitchell, the co-training learning technology has been extended from different aspects, examples including the number of classifiers [10][30][53], the multiple learning algorithms [42][52], and features selection [47][50].

All extensions retain the kernel idea of co-training, exploiting the characteristics of large amounts of unlabel data to improve the accuracy of models.

Chapter 4 Duplicate Detection in FAERS Dataset

4.1 Characteristics of FAERS Dataset

FAERS is a database designed to assist the FDA to monitor the safety of drugs after the listing, which includes adverse events and medication errors reported information. In United States, healthcare professionals (such as doctors, pharmacists, nurses and others) and consumers (such as patients, their families, lawyers and others) are voluntary to reporting adverse events and medication errors. These reports may also be submitted to the manufacturers. The manufacturers, however, according to the FDA regulation, are obligated to report to FDA regularly any adverse events associated with their medical products.

As described in Section 2.1, the FAERS database is composed of seven relational tables, in which the DEMO table is the master file that records information directly related to patients and the report itself. There are in total 23 attributes in DEMO file, including fields: CASE, I_F_COD, FOLL_SEQ, IMAGE, EVENT_DT, MFR_DT, FDA_DT, REPT_COD, MFR_NUM, MFR_SNDR, AGE, AGE_COD, GNDR_COD, E_SUB, WT, WT_COD, REPT_DT, OCCP_COD, DEATH_DT, TO_MFR, CONFID, REPORTER_COUNTRY. Detail meaning of each attribute is shown in Table 4.1. We particularly highlight some attributes that are important to the duplication problem.

ISR (primary key) refers to the unique number for identifying an FAERS report, CASE specifies the number for identifying an FAERS case. In other words, an FAERS case

(event) may have several different reports, i.e., different ISRs. The main reason behind this phenomenon is that an initial referring to the same FAERS case. This information is recorded in the I_F_COD field, for “I” denoting initial and “F” denoting following.

Table 4.1 Detail description of table DEMO.

Field name Null probability (07Q2) Explanation

ISR 0 Unique number for identifying an FAERS report.

CASE 0 Number for identifying an FAERS case.

I_F_COD 0 Initial or follow-up status of report.

FOLL_SEQ 93.1 The sequence number of a follow-up report.

IMAGE 0 Identifier for an FAERS report image.

EVENT_DT 31.3 Date adverse event occurred or began.

MFR_DT 6.9 Date manufacturer received information.

FDA_DT 0 Date FDA received report.

REPT_COD 0 Type of report submitted.

MFR_NUM 6.9 Manufacturer's unique report identifier.

MFR_SNDR 6.9 Verbatim name of manufacturer sending report.

AGE 38.8 Age.

AGE_COD 38.8 Unit.

GNDR_COD 5.8 Gender.

E_SUB 0 This report was submitted under the electronic

submissions procedure.

WT 65.9 Weight.

WT_COD 65.9 Unit.

REPT_DT ≒0 Date report was sent.

OCCP_COD 16.3 Reporter's type of occupation.

DEATH_DT 92.7 This field remains but is no longer populated with data from 2010.

TO_MFR 93.1 Voluntary reporter also notified manufacturer.

CONFID 93.1 Voluntary’s identity should not be disclosed to the

product manufacturer.

REPORTER_COUNTRY 0 Reporters’ country.

There are lots of missing values in the FAERS database. As an illustration, we have analyzed the dataset of 2007Q2, computing the probability for an attribute being null value. The statistics are also listed in Table 4.1. There are four attributes whose null probabilities are over 90%, including FOLL_SWQ (the sequence number of a follow up report), DEATH_DT (the death date of the patient), TO_MFR (whether or not the reporter also notified manufactures), and CONFID (whether or not the voluntary identity should not be disclosed to the manufacturer). Since null values cause many data analysis problems, we will describe in Section 5.1 our approaches for handling missing values.

4.2 Duplicate Reporting Problem in FAERS Dataset

As described in Section 4.1, healthcare professionals and consumers can report adverse events and medication errors in voluntary directly to FDA and these reports can also be submitted to the manufacturers. In addition, according to the guidance released by the FDA [15], any drug manufactures who are aware of any adverse event involving drugs belonging to their products has the obligation to report that event.

Hence, if an event involving several drugs (this is usually the case), the FDA may receive multiple reports referring to the same case, but from different manufacturers.

All of the situations described above would cause duplicate records in the FAERS database, which may or may not be identified by the FDA. For example, Table 4.2 is a sample dataset extracted from FAERS. Records #12 and #13 have the same CASE no.

but different ISR no., and both are recorded as initial reports. In other words, these two reports are duplicate, which are indeed reported by different reporters. The last two

records, however, show another scenario. In this case, both have different CASE no.

but the same ISR no. and are identified as initial reports. Since both records exhibit the same values in all other attributes, we conclude that they are duplicates. Unfortunately, this case was not correctly identified by the FAERS system.

The problem of duplicate detection in the FAERS database is complicated by the existence of follow-up reports. A follow-up indeed is a compensation for the initial report, which contains update information, such as information modification, medication changes, and adverse reactions changes. As a consequence, a follow-up closely resembles its initial report or other linkage follow-ups. A poorly designed detection method that overlooks this phenomenon would yield incorrect results, wrongly identifying two records of (initial, follow-up) or (follow-up, follow-up) as duplicate reports, i.e., (initial, initial). Action taken for dealing with duplicate reports is different from that for initial / follow-up or follow-up / follow-up cases. If two records are identified as duplicate, only one record should be retained, while a follow-up should be merged with its initial report or preceding follow-up to form a more accurate report. Confusing these two situations will bias the case occurrences when we perform some ADR signal detections and result in incorrect signals. To the best of our knowledge, all previous work on ADR duplicate detection does not consider differentiating follow-up linkage from real duplicate.

Table 4.2 FAERS data sample.

In accordance with the previous discussions, we view the problem of duplicate detection in the FAERS database as a tri-class classification problem. That is, given any two ADR records, we would like to classify this record pair into the following three scenarios (classes):

(1) Real duplicate (label is II): This denotes that two records are true duplicate of each other. In the FAERS data format, this case corresponds to records with identical CASE no., different ISR no., and both code “I” in the I_F_COD field. For example, see the 12th and 13th records in Table 4.2. We use “II” label to denote record pairs belonging to such scenario.

(2) Follow-up linkage (label is IF&FF): This denotes the relationship between the record pair is initial / follow-up or follow-up / follow-up. According to the FAERS coding format, the former corresponds to record of the same CASE no., but different ISR no., and having diffident I_F_COD, one with “I” and another with “F”. For example, the 13th and 14th records in Table 4.2 represent this case. The latter corresponds to record with identical CASE no., but different ISR no., and the same

I_F_COD of “F”. Records 10 and 11 in Table 4.2 is an example of this case. We use label “IF&FF” to denote this category.

(3) Others (label is OTHER): This denotes all case other than real duplicate and follow-up linkage. We use label “OTHER” to denote this category. Intuitively, real duplicate and follow-up linkage are rare situations. Most of the record pairs belong to

“OTHER” category.

Chapter 5 Proposed Duplicate Detection Method

In this chapter, we introduce a duplicate detection algorithm based on ensemble and co-training learning. We first describe the data preprocessing mechanisms in Section 5.1, training set construction in Section 5.2, and then detail our proposed method in Section 5.3.

5.1 Data Preparation

The first step of data preparation is the choice of attributes. We chose fifteen fields (EVENT_DT, MFR_DT, FDA_DT, REPT_COD, MFR_NUM, MFR_SNDR, AGE, GNDR_COD, WT, REPT_DT, OCCP_COD, REPORTER_COUNTRY, DRUGNAME, PT, OUTCOME) out of the FAERS database. The decision was made according to two factors: The attribute contains many missing values or is meaningless to the task of duplicate detection. Detail reasons of choosing attributes are shown in Table 5.1.

Secondly, we deleted all tuples containing missing entry. This is because if two records are both missing on the same field they may be identified as the same value during similarity calculation, thus severely biasing the analysis results. For example, the 07Q2 dataset originally contains 83977 records in FAERS; it remains 12789 records after deleting records with null values.

Table 5.1 Detail description for attribute selection.

Field name Null probability (07Q2)

populated with data from 2010.

TO_MFR 93.1 Y Too much null value will cause training data

reduction.

CONFID 93.1 Y Too much null value will cause training data

reduction.

REPORTER_COUNTRY 0

Finally, the data preparation ends with transforming the data file into labeled record pairs, each of which is represented by a similarity vector. Let F be a selected

data file consisting of g records and f attributes taken from the FAERS database, and F’

denote a copy of F. For any two records ri and r’j, ri from F and r’j from F’, we construct the following new record:

l sim(f

i1, f ’j1) sim(fi2, f ’j2) … sim(fih, f ’jh)

where l denotes the label of this record pair, fik and f ’jk the values of attribute Ak

in F

and F’, respectively, for 1 ≤ i, j ≤ g and 1≤ h ≤ k, and sim(⋅) is the similarity function.

Figure 5.1 illustrates the record pair transformation. After the transformation, every record is represented as a label attribute plus a vector of similarities. Figure 5.2 is a real sample of this transformed data set.

Label S₁ S₂ S_h

l

1 sim(f11, f ’21) sim(f12, f ’22) … sim(f1h, f ’2h)

… ...

sim(f11, f ’g1) sim(f12, f ’g1) … sim(f1h, f ’gh) sim(f21, f ’31) sim(f21, f ’32) … sim(f2h, f ’3h)

…

sim(f₂₁, f ’_g1) sim(f₂₂, f ’_g1) … sim(f_3h, f ’_gh)

…

l

_n sim(f_g-1,1, f ’_g1) sim(f_g-1,2, f ’_g2) … sim(f_g-1,h, f ’_gh)

Figure 5.1 An illustration of the record pair transformation.

Figure 5.2 A sample of transformed data.

Before we complete the discussion of data preparation, we will describe the methods adopted in the similarity calculation. The attributes are heterogeneous, which can be divided into two types: categorical and numerical. Different similarity measurements have to be used for different types of attributes.

The categorical fields include REPT_COD, MFR_NUM, MFR_SNDR, GNDR_COD, OCCP_COD, REPORTER_COUNTRY, DRUGNAME, PT, and OUTCOME. The following simple function was used,



Numeric fields contain EVENT_DT, MFR_DT, FDA_DT, WT, AGE, and REPT_DT, which can further be divided into two categories: interval and ratio.

Obviously, calendar dates, such as EVENT_DT, MFR_DT, FDA_DT, and REPT_DT, are types of interval. The similarity calculation is shown as follows:

On the other hands, fields WT and AGE belong to the type of ratio. The calculation of similarity is defined as follows:

sim(f, f’) = 1 / (1 + |f – f ’|).

However, in the FAERS database different units are allowed for fields WT and AGE, such as kilogram, gram, and pounds for WT and year, month, week, day, and hour for AGE. Therefore, we have to convert them to a single unit of measurements, i.e., using kilogram for WT and month for AGE. Table 5.2 shows an example of similarity calculation for input record pair ((GLAXOSMITHKLINE, M, 20050809, CN, FRANCE, 68.4, 120) and (BRISTOL-MYERS SQUIBB COMPANY, M, 20051003, OT, FRANCE, 79.2, 60)).

Table 5.2 An example similarity calculation.

Field name MFR_SNDR GNDR REPT_DT OCCP COUNTRY WT AGE

The construction of the training set is not trivial. Two main issues worthy of concern are imbalanced classes and uncertain labeling.

In the FAERS database, since most records have no duplicate or follow-up, only

very few record pairs are labeled as II and IF&FF; thus the OTHER class dominates.

As shown in Table 5.3, there remain 17277 records in the 07Q2 dataset after pruning missing records, and we obtained 162409536 record pairs after cross joining all records, of which 24674 record pairs are II and IF&FF, containing only 0.15% of the population. It is well known that imbalanced data significantly compromise the performance of standard classification algorithms, inducing rules highly factor majority classes over minority ones [21][46]. To alleviate the problem caused by imbalanced data, we adopted the technique of random undersampling, i.e., randomly choosing a subset of data from the majority class with size approximately equal to minority classes.

Next, we address our method for handling the second issue. As we have mentioned in Section 4.2, it is not uncommon that duplicated or follow-up reports were not identified by FDA and not correctly coded in their FAERS database. This means the class of OTHER indeed contained record pairs that belong to II and IF&FF. And these pairs, in fact, are the target set that our algorithm aims to detect. In other words, record pairs of the OTHER class are uncertainly labeled. To alleviate the influence of uncertain labeling of class OTHER, we applied an additional strategy to random undersampling, limiting the pair similarity of data belonging to OTHER no larger than 0.5. This is because highly similar record pairs tend to belong to II and IF&FF.

Table 5.3 An example of training set construction.

07Q2 Total

5.3 Ensemble and Co-training based Method

In this section, we present our algorithm for detecting duplicate record pairs as well as follow-up linkage.

Our algorithm is a hybrid of ensemble and co-training learning strategies. The idea is motivated by the fact that we have to build learning models from the data set with very small amounts of labeled instances while large amounts of unlabeled data.

We adopt multiple, diverse base classifiers, as a whole to determine the class of a given unknown record pair. Meanwhile, our algorithm utilize unlabeled dataset, choosing those receiving high commitment among the initial base classifiers for being classified as II or IF&FF into the training set, and retain the initial models using the updated training set.

As shown in Figure 5.3, our algorithm proceeds in three stages. First, the training set L (with assigned labels) constructed as described in Section 5.2 is processed to build the initial ensemble classifier C. Next, some of the unlabeled data, which refer to all record pairs not identified as II or IF&FF and having high similarity over the threshold are added into the original training set. The choice follows the principle of majority teaching minority. That is, a record pair from the unlabeled set is chosen if it is classified as the class of IF or IF&FF by over half of the base classifiers, c1, c2, …, ct. We consider only II or IF&FF class because these two classes are relatively small compared with the OTHER class. Each chosen unlabeled record pair is given a pseudo label of II or IF&FF, depending on the predicting result of C. Finally, the updated training set L* is used to build the final ensemble classifier C*.

Figure 5.3 A conceptual depiction of proposed ensemble and co-training based detection algorithm.

≤threshold stage 2:

≤threshold ≤threshold

U

OTHER II IF&FF

stage 3:

stage 1:

L

C₂ C_l

C₁ …

≤threshold

U

+ +

II IF&FF OTHER

C^*₁ …

C^*

C^*₂ C^*l

Algorithm 5.1 details our ensemble and co-training based detection method. The input include a labeled data set L, unlabeled data set U, commitment threshold α, and similarity threshold θ. Let y1, ..., yk denote the possible labels; in this problem, k = 3, denoting three different classes. C1, ..., Ct are the base classifiers. The commitment threshold α denotes the least number of classifers having agreement on the predicting results. In our implementation, we have chosen four different representative classifiers, including Bayesian classifier, instance-based classifier, rule-based classifier, and decision tree, and set α = 3.

The prediction of an unlabeled instance x by the ensemble of the base classifier

C

1, ..., Ct is implemented in the following way. The prediction results of C1, ..., Ct on x are stored as a binary vector V = (v1, ..., vk), vi = 0 or 1, 1 ≤ i ≤. The δ(⋅) function is used as a true-or-false determination of the input statement. For example, if Ci(x) = y1, then δ(Ci(x) = y1) = 1, and so v1 = 1. Label y represents the most agreed prediction of x among the t classifiers. This is described in step 5 as

y = argmax

_y_jΣi = 1, tδ(Ci(x) = y_j),

and m is used to keep this number of agreements. So x can be added into U’ only if y = II or IF&FF and m ≥ α (step 7), meaning that at least α classifers have agreement on the predicting results, which implements the concept of majority teaching minority.

Algorithm 5.1: An ensemble and co-training based duplicate detection method

Input: labeled data set L, unlabeled data set U, commitment threshold α, and similarity

threshold θ.

Output: classifier C

^*.

Method:

1. U’ =

φ

; 2. for 1 ≤ i ≤ t do

3. build base classifier Ci on L;

4. for each unlabeled example x ∈ U and sim(x) ≥ θ do

// predict the class y of x by C = Ensemble (C

1, ..., Ct);

5. y = argmax_y_jΣi = 1, tδ(Ci(x) = y_j) ; 6. m = Σi = 1, tδ(Ci(x) = y) ;

7. if m ≥ α and (y= II or IF&FF) then 8. U’ = U’ ∪ {x};

9. endfor 10.

L = L ∪ U’;

11.

for 1 ≤ i ≤ t do

12. build new classifier C^*i on L;

13.

return C

^* = Ensemble (C^*₁, ..., C^*_t);

Chapter 6 Experiment

We conducted two experiments to evaluate the effectiveness and performance of our method. The first experiment focused on the accuracy of our method, compared with four representative classifiers. The second experiment inspected the effect of removing identified duplicate as well as merging follow-ups on the results of signals generated by some ADR detection methods. This chapter presents the results of these two experiments and our discussion.

6.1 Correctness Analysis of Duplicate Detection

A recent work by Cagliero and Garza [8] has provided a widespread comparison of contemporary classification methods. According to their study, we selected four representatives from different categories of classifiers with relatively high accuracy in our experiment. They are Bayesian classifier (Bayesian Network [29][41]), instance-based classifier (IBk [3]), rule-based classifier (JRip [9]), and decision tree (J48 [36]), all of which are available on the Weka package. Additionally, we also implemented an ensemble of these four classifiers, serving as an additional comparator to our algorithm.

We chose seven quarters of datasets from FAERS, described as follows:

• 05Q4: include 139 labeled data, 140 unlabeled data.

• 06Q3: include 255 labeled data, 255 unlabeled data.

• 07Q2: include 180 labeled data, 180 unlabeled data.

• 08Q1: include 196 labeled data, 197 unlabeled data.

• 09Q4: include 300 labeled data, 300 unlabeled data.

• 10Q3: include 318 labeled data, 318 unlabeled data.

• 11Q2: include 624 labeled data, 624 unlabeled data.

Four commonly used measures were adopted in this experiment, including accuracy, precision, recall, and F-measure. Below are the detail definitions.

instances

As an illustration of computing these measures, we show in Table 6.1 the confusion matrix yielded by executing Bayesian Network on dataset 07Q2. One can observe that the IF&FF class has 129 correctly classified instances and 51 wrong instances. The correctness of algorithm BN on this dataset is (19 + 61 + 49) / 180 = 0.7166 and the precision of class IF&FF is 19 / (19 + 9 + 1) = 0.6551

Table 6.1 07Q2 class Confusion Matrix by Bayesian Network.

Confusion Matrix

a b c ←classified as

19 39 2 a = IF&FF

9 61 0 b = II

1 0 49 c = OTHER

Table 6.2 presents the results of all measures, including correctness, precision, recall, and f-measure, on the 07Q3 dataset. An accuracy of 100% means that the measured values are exactly the same as the given values. One can observe that our

在文檔中利用合作訓練與集成學習法檢測藥物不良反應事件通報系統中之重複記錄 (頁 27-0)