Formal Definition - Problem Description - 特定偽缺漏值好發生之資料群集之偵測

Chapter 4 Problem Description

4.2 Formal Definition

Following the notation used in Chapter 3, let T~

denote the recorded table of T with attributes A = {A₁, A₂, …, A_n}, and Dom(A_i) be the set of values for attribute A_i, 1

 i  n. Given a suspected disguise value v, for v  Dom(D) and D  A, we like to discover if v is indeed a disguise value, the data group of T~

that is most prone to using v as a disguise value. To facilitate the discussion, we first formalize the term data group.

Definition 4.1. A data group (G1 = g1, G2 = g2, …, Gp = gp) defined on a attribute subset {G1, G2, …, Gp}  A, identifies the projection of T~

on G1 = g1, G2 = g2, …, Gp

= gp. That is, the set of tuples in group (G1 = g1, G2 = g2, …, Gp = gp) all have the same values on attributes G1, G2, …, Gp. Hereafter, as it is clear from the context, we use (g1, g2, …, gp) instead of (G1 = g1, G2 = g2, …, Gp = gp).

Now let G denote the set of all data groups induced by the attribute set A – {D}.

Note that we need at least one attribute other than the grouping attributes and the disguise attribute D to perform the EUS procedure. It is noteworthy that the empty group means no projection is performed on the original table T~

. So this case corresponds to the discovery of the maximal embedded unbiased sample on T~_v

. In this context, the problem discussed in [6] can be regarded as a special case of our problem.

Based on the concept of maximal embedded unbiased sample, we can formalize the problem of detecting the data group most prone to the specific disguise value v as finding the best data group g* in G that maximizes Eq. (3.5). Since the searching of maximal embedded unbiased sample is performed on the projection of T~

on v

associated with group g, i.e., T~_v_,_g

, instead of T~_v

, the problem is now formalized as



This value, however, is proportional to the cardinality of the table of concern. The larger cardinality (number of value pairs) of table T~

, the larger this value is. In order to not favor larger projections of T~

Similarly, we introduce the normalize DV-score of v in T~

, denoted as ndv(v,T~ normalized DV-score ndv(v,T~_g

), i.e.,

which can be rewritten as

The complexity of finding g* is immense. Let mi be the cardinality of attribute Ai

in T~, 1  i  n. Without loss of generality, we choose A1 as the suspected attribute D.

Each attribute Aj, 2  j  n, can take either one of mj different values if being involved in forming the data group or take the empty value if not being involved, leading to at most (m2 + 1)  (m3 + 1)  …  (mn + 1) different data groups. But note that at least one attribute has to be excluded in forming the data group, meaning that we have to discount all the cases that all attributes are involved in forming the data group. Then, the number of all possible data groups induced by the set {A2, A3, …, An} is

Example 4.1. Let us consider Table 4.1. Suppose we choose “male” on “Gender”

as the suspected disguise value v. Then the number of data groups induced by attributes

“Martial Status”, “Literacy”, and “Education” is (|Dom(Martial Status)| + 1)  (|Dom(Literacy)| + 1)  (|Dom(Education)| + 1) – (|Dom(Martial Status)| 

|Dom(Literacy)|  |Dom(Education)|) = 3³ –2³ = 19. Specifically, let consider the group defined on Martial Status = “married”. Table 4.2 shows the resulting projection

married

on this group, wherein the shaded part corresponds to further projection on

“Male”, say T~_male_,_married

. According to Eq. (4.3), we have to find the maximal subset of T~_male_,_married

that resembles (an unbiased sample of) T~_married

. This process continues for the projections defined on all other groups to determine the best data

group g*.

Table 4.1 An example of dataset.

Gender Marital Status Literacy Education

Male Married Literate High school

Male Single Literate High school

Male Married Illiterate High school

Male Single Illiterate High school

Male Married Literate College

Male Single Literate College

Male Married Illiterate College

Male Single Illiterate College

Male Married Literate High school

Male Single Literate High school

Female Married Illiterate High school

Female Single Illiterate High school

Female Married Literate College

Female Single Literate College

Female Married Illiterate College

Female Single Illiterate College

Male Married Literate High school

Female Single Literate College

Female Married Illiterate College

Female Single Illiterate High school

Table 4.2 The resulting projection of Table 4.1 on “Married”.

Gender Marital Status Literacy Education

Male Married Literate High school

Male Married Illiterate High school

Male Married Literate College

Male Married Illiterate College

Male Married Literate High school

Female Married Illiterate High school

Female Married Literate College

Female Married Illiterate College

Chapter 5 The Proposed GA-based Detection Method

In this chapter, we introduce a genetic algorithms based method for detecting the data group most prone to a specific disguise value. In Section 5.1, we first describe the general framework of our approach, and then detail individually the main components in subsequent subsections, including the chromosome representation, the operations of crossover, mutation, selection, and the fitness function to evaluate the chromosome.

Some candidate pruning condition will be shown in Section 5.5.

5.1 General Framework

Figure 5.1 shows the general framework of our proposed genetic algorithms based (GA-based) method. The input of our algorithm is a recorded table T~

and a suspected disguise value v which we intend to detect the data group most prone to it.

Figure 5.1 A general framework of the GA-based method.

5.2 Chromosome Representation

The first step and the most important part of GAs is the chromosome representation. A chromosome representation is an encoding of a possible solution of the problem. In our study, we encode each solution into a vector of non-repeated decimal integers. A non-zero integer indicates the corresponding attribute values are used for forming the data group, while a zero value represents the attribute is not included in forming the group, i.e., it is used for evaluating the degree of fitness using the EUS heuristic.

For example, consider a four attribute table T, whose attribute A₁ contains two Input: A table T~

and a suspected disguise value v Output: The chromosome with best degree of fitness Method:

Initialize the parameters;

Generate a population P randomly;

generation ← 1;

while generation  max_gener do Clear the new population P’;

Evaluate the fitness of each individual in P;

while P '  population_size do Select two parents from P;

Perform crossover operation;

Perform mutation operation;

Put the offspring into P’;

endwhile P ← P’;

generation ← generation + 1;

endwhile

We choose v11 on A1 and v32 on A3 for grouping and leave A2 and A4 as the attribute for evaluating the degree of fitness of each chromosome. The chromosome can be represented as shown in Figure 5.1.

Figure 5.2 An example for chromosome representation.

5.3 Evolutionary Operations

It is necessary to choose the parent chromosomes from the population before evolutionary operation, which is called a selection. According to the evolution principle, choosing the chromosomes with higher degree of fitness can generate better population. However, this approach may lose the diversity of the population because of restricting the possible solutions. The population will converge too quickly and may not be able to find the optimal solution. In this study, we adopted the tournament selection method proposed by Mitchell [10] in 1996, which randomly chooses the parent chromosomes from the current population and a random number r between 0 and 1 then compare with a predefined value, usually set as 0.75. If r is less than the value, then we choose the chromosome with higher fitness value. On the other hand, if r is larger, then we choose the chromosome with lower degree of fitness. The chromosome can also be selected at the next time. We also adopted the elitism principle proposed by Mitchell [10] that the best chromosome should be preserved into

 

evaluating the degree of fitness.

the new population.

The crossover operation is used to generate the offspring in GAs, by exchanging the chromosome in two parents chosen from population. Our method adopts one-point crossover, which is one of the most common crossover operations. This operation works by first selecting a crossover point randomly, dividing the pair of parents by this point, and then exchanging the gene sequence to form the offspring. The operation is shown in Figure 5.2.

Figure 5.3 One-point crossover.

Mutation operation is used to increase genetic diversity. In our method, the position for mutation is selected randomly. The value of the selected gene mutates in the following way. If the gene is zero, it is changed to a random non-zero integer. On the other hand, if the gene is non-zero, it changes to another integer including zero.

5.4 Fitness Function

Intuitively, we can adopt the normalized DV-score ndv(v,T~_g

) described in Eq. (4.3) as the fitness function to measure the possibility that v is used as a disguise in the projection T~_g

induced by the data group g represented by the chromosome. Note that Parent

Child

1 0 7 0 2 4 0 0

1 4 0 0 2 0 7 0

ndv(v,T~_g

) requires computing the normalized CBSQSs for each subset U of T~_v_,_g , which consumes lots of computations proportional to the number of different attribute value pairs in U. Although it is not easy to reduce the complexity of the normalized CBSQS, we can simplify the denominator term to C(k, 2), where k denotes the number of attributes in T~

not serving as the disguise and grouping attributes.

Lemma 5.1. Consider a subset U of the projected table T~_v_,_g attributes of U, say Ap and Aq. Intuitively, the total probability of value pairs from any two attributes should be equal to 1. Therefore, we have

Since U consists of k attributes, if we select two attributes from these k attributes once a time, then we can obtain total C(k, 2) combinations. It follows that in T~_g fitness function for evaluating  is defined using the following simplified normalized DV-score.



5.5 Candidate Pruning

As shown in Section 4.2, the search space of candidate data groups is in the order of O(m^n-2), an exponential function of m and n. In order to avoid unnecessary exploration of the search space, we developed several optimization techniques to prune unqualified candidates. These optimizations include attribute-based pruning, value-based pruning, record-based pruning, and hierarchy-based pruning.

Optimization 1 (attribute-based pruning): Any data group with cardinality larger n-2 should be pruned, where n is the number of attributes in T~

. This is because EUS-based fitness function requires at least two nongrouping and disguise attributes to calculate the correlation between value pairs that illustrated in Eq. (4.2). This means that any chromosome with less than two zero genes should not be generated. In our approach, we first make sure chromosomes generated in the initial population are qualified, and enforce this rule to the operation of crossover and mutation. Specifically, any offspring generated by a crossover contains less than two zero genes, then the operation is a failure mating and so we keep the parents to the next generation.

Similarly, if there are exactly two zero genes in the chromosome undergone mutation, then the mutation operation will select a nonzero gene to change its value.

Optimization 2 (value-based pruning): Any data group resulting a projection T~_g containing one value on the disguise attribute D should be excluded, no matter the

attribute D, then T~_v_,_g

will be empty (if the value is not v) or equal to T~_g

(if the value is v), both making the correlation computation meaningless. In our approach, we ensure the initial population excluding such kind of candidates and assign a very bad fitness to any chromosome generated after crossover or mutation operation that resulting only one value on attribute D.

Optimization 3 (record-based pruning): Any data group resulting in a projection T~g

containing a too small amount of records will be pruned. This is because a smaller subset tends to lose good representation of the original dataset. Therefore, we avoid creating candidates with this problem during generating the initial population, and also assign a very bad fitness to any chromosome with this problem after the processes of crossover and mutation.

Optimization 4 (hierarchy-based pruning): This optimization prunes candidates by exploiting the hierarchy information existing between attribute values. Take school information shown in Figure 5.4 as an example. The department of “computer science”

is a descendant under the college of engineering. In the context of relational database there is a functional dependency between attributes “College” and “Department”, i.e.,

“College” is dependent on “Department”

“Department” →“College”

Figure 5.4 A hierarchical relation of school information.

We utilize this information to devise another type of pruning. Consider a group g

= (g1, g2,…gp). If there exist two values gi and gj, and gi is a descendant of gj in the value hierarchy, then g can be pruned and replaced by g’ = g – {gj}, i.e., g’ = (g1, g2, …, gj-1, gj+1, gp). This is because the resulting projections T~_g

and ~'

T have exactly the g

same tuple values in every nongrouping attribute.

NUK NTU

Engineering

CS EE …

Management

MS FM …

University

College

Department

Chapter 6 Experiments and Analysis

In this chapter, we describe our experiments performed to evaluate the prposed GA-based method. All experiments were conducted on a personal computer running the Microsoft Window 7 Professional Edition operating system, with Intel Core i7-2600 3.4Ghz CPU, 8GB main memory, and a 500GB hard disk. We used Microsoft SQL SERVER 2008 R2 as the database system, and all programs were coded in C#.

The order q in Eq. (3.2) was set to 1. The following parameter settings were used in our method.

max generation: 100 population size: 30

crossover probability: 0.75 mutation probability: 0.033

Our experiments consist of two parts: The first part focuses on the performance of our method. The second part shows the correctness of the solution. To evaluate the efficiency and the correctness of our GA-based method, we compared it with an exhaustive method, simply evaluating all possible candidates in set G to find the best solution.

6.1 Experimental Results on Execution Time

The experiment on execution time we used the Pima Indians Diabetes dataset [15]

included in the dataset. We evaluated our method with respect to the size of the dataset and compared it with an exhaustive method. In order to obtain larger datasets, we duplicate the Pima Indians Diabetes dataset up to 5 times. The experimental result is shown in Figure 6.1.

Table 6.1 The characteristic of Pima Indians Diabetes dataset.

Data set Number of records Number of attributes

Pima Indians Diabetes dataset 768 9

Table 6.2 The descriptions of the attributes included in the Pima Indians Diabetes dataset.

No. Attribute Description

1 NPG Number of times pregnant

2 PGL Plasma glucose concentration

3 DIA Diastolic blood pressure

4 TSF Triceps skin fold thickness

5 INS Serum insulin concentration

6 BMI Body mass index

7 DPF Diabetes pedigree function

8 AGE Age in years

9 Class

Attribute which illustrated whether the patient is with or without diabetes

Figure 6.1 Comparision of execution time between exhuastive method and GA-based method.

As the results demonstrate, our GA-based method is faster than the exhuastive method no matter how many tuples are there in the dataset. This phenomenon becomes more significant when the dataset grows larger. Note that the main factor on execution time is the process of evaluating the chromosomes (data group), i.e., computing normalized CBSQS. our method can significantly prune the number of candidate data groups, leading to fewer times of fitness evaluations.

We also conducted experiment to observe the performance distribution of average fitness during the generations. As the results illustrated in Figure 6.2, the average fitnesses become better as the number of generation increases, and the results converge to the best solution at the 8^th generation when choose “0” on attribute “NPG”, and converge at the 94^th generation when choose “0” on attribute “PGL”.

Figure 6.2 Average fitness v.s number of generation on different dij

6.2 Experimental Results on Solution Correctness

In the second part of our experiments, we tested the solution correctness of our method on the FDA Adverse Event Reporting System dataset [3] from January 1 to March 31 in 2004 (2004Q1 in short), which is released by U.S. Food and Drug Administration (FDA), containing the adverse drug reaction reports. Because there hold a large number of missing values in several attributes, we choosed 4 attributes with the fewest missing data, including “EVENT_DT”, “GNDR_COD”, “AGE”, and

“WT”. We divided the attribute “EVENT_DT” in to three attributes, say “Year”,

“Month”, and “Day”, so that we obtained totally 6 attributes in the dataset. Since the values on attribute “AGE” and “WT” are continuous data, we divided the value into nine age levels and ten weight levels according to Gaubius method and body surface

area method [4] as shown in Table 6.3 and Table 6.4. Finally, the detail of the resulting dataset is illustrated in Table 6.5 and Table 6.6.

Table 6.3 Age level divided by Gaubius method.

<1 1~2 2~3 3~4 4~7 7~14 14~20 20~60 >60

Table 6.4 Weight level divided by body surface area method.

<2.5 2.5~3.2 3.2~4.5 4.5~10 10~15 15~23 23~30 30~40 40~54 >54

Table 6.5 The characteristic of 2004 Q1 FAERS data set.

Data set Number of records Number of attributes

FAERS 2004 Q1 18140 6

Table 6.6 The descriptions of the attributes included in the Pima Indians Diabetes dataset.

No. Attribute Description

1 Year Year of the adverse event occurred or began.

2 Month Month of the adverse event occurred or began.

3 Day Date of the adverse event occurred or began.

4 AGE Numeric value of patient's age at event.

5 WT Numeric value of patient's weight.

6 GNDR_COD Code for patient's sex.

According to the study by Pearson [13] “January 1” is a common disguise value used as a surrogate for “data unknown” in entering Event Date data into the FAERS system. Similarly, the first day of other months, such as “February”, “March”, “April”, is also very likely used as a disguise. For this reason, we expect to find that when selecting “January 1” and “February 1” as disguise values vd can find the best discovered data groups will consist of attribute “January” and “February”, respectively. Table 6.5 shows the experimental results on both GA-based method and exhaustive method. Zero values mean the attributes are used for evaluating the degree of fitness.

Table 6.7 The solutions comparison between genetic algorithms approach and exhaustive method.

vd Gender Year Month AGE WT fitness GAs based method

January 1

male 2003 January 0 0 0. 8863 Exhaustive method

(pruning)

male 2003 January 0 0 0. 8863

Exhaustive method male 1998 January 0 0 0.9595

GAs based method

February 1

male 2004 February 0 0 0.4182 Exhaustive method

(pruning)

male 2004 February 0 0 0.4182

Exhaustive method male 2000 February 0 0 0.6358

In this experiment, we performed two different exhaustive approaches, with or without executing record-based pruning presented in Section 5.5. The solutions generated by these two different exhaustive methods were different for both "January

1" and "February 1". Specifically, the exhaustive method without record-based pruning exhibit significant better solution than that without pruning. A further inspection showed that the solution found by the exhaustive method without record-based pruning only contains 18 and 20 of 18140 in table T~_g

. Since a small T~_g

loses good representation of the original dataset and leads to biased results, we choose to use the exhaustive method with record-based pruning when comparing with our GA-based method.

For “January 1” as the disguise value, the system returned g* composed of

“January” on “Month” and “male” on “Gender”, and for "February 1" it returned

“February” on “Month” and “male” on “Gender”. Obviously, our proposed GA-based approach can find the same solutions generated by the exhaustive method. However, the results do not exactly match the analysis conducted by Pearson. This is because the statistical analysis in [13] only considered a day of month, not the whole data set including other attributes, such as attribute “Gender”.

Chapter 7 Conclusions and Future Work

7.1 Conclusions

The problem of detecting the data group most prone to a specific disguise value is a novel issue of detecting disguise missing data, which has not yet been addressed before. In this thesis, we proposed a genetic algorithms based approach that can effectively find the optimal solution to the problem. We use and modify the CBSQS-based method proposed by Hua and Pei [6] to devise a fitness function which can successfully evaluate the fitness of candidate chromosomes. We also develop some effective optimization techniques to avoid unnecessary exploration of the candidate space. Experimental results showed that our method can discover the same optimal results generated by exhaustive method and the discovered data group is the same as that derived by previous work that relied on tedious statistics based manual

在文檔中特定偽缺漏值好發生之資料群集之偵測 (頁 27-0)