MapReduce-based Cube Computation - 基於MapReduce之藥物不良反應分析資料方體的計算方法

Inspired by recent success of MapReduce computing framework on processing large data set or computation-intensive work, some researchers have used MapReduce framework to build OLAP data cube, including Sergey and Yurk [37], You et al. [42], Nandi et al. [33], [34], and Lee et al. [29]. One of the prevailing methods is the CubeGen method proposed by Wang et al. [41].

Method CubeGen splits the cube lattice is a way similar to Agarwal et al.’s method and assigns a MapReduce job to handle each pipeline batch. The Mapper function chooses a dimension that exists in all cuboids to be computed in the pipeline and assign that dimension as the output’s key, along with other dimensional values as the output value. Then the shuffle function sends all outputs with the same key to the corresponding reducer. Each reducer is responsible for generating the final aggregation for each cuboid of the assigned pipeline batch. As shown in Figure 3.3, for each key value, e.g., s0, the reducer overlooks one dimensional value, e.g., d0, and group by all values in the value-list to achieve the aggregated count. In this example, it yields (-, s0,

r1) = 2 and (-, s0, r2) = 1. The process continues to overlook more dimensional values to obtain aggregation results for other cuboids.

Figure 3.3 An example of CubeGen for computing the pipeline batch consisting of (-, See, Reaction) and (-, Sex, -).

(-, Sex, Reaction)

(-, Sex, -)

Mapper Mapper

<s0, d0r1>

<s1, d0r1>

<s0, d0r2>

Reducer Shuffle

<s0, [d0r1, d0r1, d0r2]>

<s1, [d0r1]>

(-, Sex, Reaction)

(-, Sex, -)

(-, s1, r1) = 1 (-, s0, r1) = 2 (-, s0, r2) = 1

(-, s1, -) = 1 (-, s0, -) = 3

The Proposed MapReduce-based Contingency Cube Computation

In this chapter, we present our proposed MapReduce-based methods for computing the ADR contingency cube. We use the open data of U.S. FDA Adverse Event Reporting System (FAERS) as the data source. This data, unfortunately, is not clean, which has many uncertainties. So Section 4.1 will explain how we process the data before building contingency cube.

4.1 Data Preprocessing

The main problems of FAERS data are duplicate reporting and non-standardized drug names. The former refers to the existence of duplicate records that correspond to the same ADR event due to different reporting sources and follow-up review of the event. The latter means that the drug names are not standardized; different names may refer to the same drug. The reason is very complicated, so the detailed process for normalizing drug name is beyond the scope of this thesis. We only present a brief description. The interested readers can refer to [27] for the details.

The FAERS database always releases new reports quarterly. Each time as new quarter data is reported, the preprocessing workflow shown in Figure 4.1 is executed to generate the prepared data. In the following, we describe each of these steps.

Figure 4.1 The preprocessing workflow for generating the prepared data.

1. Data selection: Besides, the whole FAERS data consists of seven tables, as shown in Figure 4.2, but only some attributes, including Sex, Age, Drug, and PT (preferred term for symptom), are essential for ADR analysis. Figure 4.3 shows the essential part of schema of the FAERS dataset.

FAERS 1. Data Selection

2. Deduplication

3. Drug name standardization

4. Transformation

Prepared Text file

Figure 4.2 The FAERS ADR data schema.

Figure 4.3 A depiction of the part of FAERS schema needed for computation.

2. Deduplication: In the FAERS system, each patient’s ADR event has a unique code, namely CASEID. Since the same case may be reported many times, each reporting record is identified by a unique PRIMARYID. For the group of records that have the same CASEID, we only have to leave one record. We choose that one with the latest FDA_DT (the date reporting to FDA).

3. Drug name standardization: For some reasons, a drug may have many different names, so we need to normalize drug name before counting record. The US NLM is the biggest medicine library in the world. We use their RxNorm [9] which is a normalized naming service to transform DRUGNAME attribute into unique identifier, RXCUI.

4. Transformation: This step performs some attribute transformation and stores the output data prepared for cube computation. First, we discretize the original numerical Age attribute into ten categorical ranges as shown in Table 4.1. Next, FAERS database is relational, so all the Drugs and PTs of a single event have to be normalized into different records, stored in different tables. Direct aggregation from such data many cause duplicate counting of event occurrence. So we join the records into a single record.

The prepared data is stored as a text file. There are four attributes, including AGE, SEX, DRUG, and PT, delimitated by pound sign (#). Multiple values of DRUG and PT attributes are delimitated by dollar sign ($). Figure 4.4 is a simple example. This format also can compress data for our MapReduce-based computation. As shown in Table 4.2, this format saves about 90 percent of size and number of records.

Table 4.1 Discretization of the AGE attribute.

Tag Range Meaning

0 0 Source value is missing or problematic, e.g., not a numerical or large than 116 years.

1 0 Infant, Newborn

2 1 to 23 months Infant

3 2 to 5 years Child Preschool

4 6 to 12 years Child

5 13 to 18 years Adolescent

6 19 to 24 years Young Adult

7 25 to 44 years Adult

8 45 to 64 years Middle Aged

9 65 to 79 years Aged

10 over 80 years Aged+

Attributes

Data file AGE SEX DRUG PT

Cleaned data a1 s1 d1, d2 p1, p2

Source data for

MapReduce a1#s1#d$d2#p1$p2

Figure 4.4 An example showing the data format for prepared text file.

Table 4.2 Number of records before and after packing.

Size Number of records

Unpacked 1.3 GB 49,540,834

Packed 150 MB 3814,472

4.2 Problem of ADR Contingency Cube

在文檔中基於MapReduce之藥物不良反應分析資料方體的計算方法 (頁 26-33)