Candidate Pruning - The Proposed GA-based Detection Method

Chapter 5 The Proposed GA-based Detection Method

5.5 Candidate Pruning

As shown in Section 4.2, the search space of candidate data groups is in the order of O(m^n-2), an exponential function of m and n. In order to avoid unnecessary exploration of the search space, we developed several optimization techniques to prune unqualified candidates. These optimizations include attribute-based pruning, value-based pruning, record-based pruning, and hierarchy-based pruning.

Optimization 1 (attribute-based pruning): Any data group with cardinality larger n-2 should be pruned, where n is the number of attributes in T~

. This is because EUS-based fitness function requires at least two nongrouping and disguise attributes to calculate the correlation between value pairs that illustrated in Eq. (4.2). This means that any chromosome with less than two zero genes should not be generated. In our approach, we first make sure chromosomes generated in the initial population are qualified, and enforce this rule to the operation of crossover and mutation. Specifically, any offspring generated by a crossover contains less than two zero genes, then the operation is a failure mating and so we keep the parents to the next generation.

Similarly, if there are exactly two zero genes in the chromosome undergone mutation, then the mutation operation will select a nonzero gene to change its value.

Optimization 2 (value-based pruning): Any data group resulting a projection T~_g containing one value on the disguise attribute D should be excluded, no matter the

attribute D, then T~_v_,_g

will be empty (if the value is not v) or equal to T~_g

(if the value is v), both making the correlation computation meaningless. In our approach, we ensure the initial population excluding such kind of candidates and assign a very bad fitness to any chromosome generated after crossover or mutation operation that resulting only one value on attribute D.

Optimization 3 (record-based pruning): Any data group resulting in a projection T~g

containing a too small amount of records will be pruned. This is because a smaller subset tends to lose good representation of the original dataset. Therefore, we avoid creating candidates with this problem during generating the initial population, and also assign a very bad fitness to any chromosome with this problem after the processes of crossover and mutation.

Optimization 4 (hierarchy-based pruning): This optimization prunes candidates by exploiting the hierarchy information existing between attribute values. Take school information shown in Figure 5.4 as an example. The department of “computer science”

is a descendant under the college of engineering. In the context of relational database there is a functional dependency between attributes “College” and “Department”, i.e.,

“College” is dependent on “Department”

“Department” →“College”

Figure 5.4 A hierarchical relation of school information.

We utilize this information to devise another type of pruning. Consider a group g

= (g1, g2,…gp). If there exist two values gi and gj, and gi is a descendant of gj in the value hierarchy, then g can be pruned and replaced by g’ = g – {gj}, i.e., g’ = (g1, g2, …, gj-1, gj+1, gp). This is because the resulting projections T~_g

and ~'

T have exactly the g

same tuple values in every nongrouping attribute.

NUK NTU

Engineering

CS EE …

Management

MS FM …

University

College

Department

Chapter 6 Experiments and Analysis

In this chapter, we describe our experiments performed to evaluate the prposed GA-based method. All experiments were conducted on a personal computer running the Microsoft Window 7 Professional Edition operating system, with Intel Core i7-2600 3.4Ghz CPU, 8GB main memory, and a 500GB hard disk. We used Microsoft SQL SERVER 2008 R2 as the database system, and all programs were coded in C#.

The order q in Eq. (3.2) was set to 1. The following parameter settings were used in our method.

max generation: 100 population size: 30

crossover probability: 0.75 mutation probability: 0.033

Our experiments consist of two parts: The first part focuses on the performance of our method. The second part shows the correctness of the solution. To evaluate the efficiency and the correctness of our GA-based method, we compared it with an exhaustive method, simply evaluating all possible candidates in set G to find the best solution.

6.1 Experimental Results on Execution Time

The experiment on execution time we used the Pima Indians Diabetes dataset [15]

included in the dataset. We evaluated our method with respect to the size of the dataset and compared it with an exhaustive method. In order to obtain larger datasets, we duplicate the Pima Indians Diabetes dataset up to 5 times. The experimental result is shown in Figure 6.1.

Table 6.1 The characteristic of Pima Indians Diabetes dataset.

Data set Number of records Number of attributes

Pima Indians Diabetes dataset 768 9

Table 6.2 The descriptions of the attributes included in the Pima Indians Diabetes dataset.

No. Attribute Description

1 NPG Number of times pregnant

2 PGL Plasma glucose concentration

3 DIA Diastolic blood pressure

4 TSF Triceps skin fold thickness

5 INS Serum insulin concentration

6 BMI Body mass index

7 DPF Diabetes pedigree function

8 AGE Age in years

9 Class

Attribute which illustrated whether the patient is with or without diabetes

Figure 6.1 Comparision of execution time between exhuastive method and GA-based method.

As the results demonstrate, our GA-based method is faster than the exhuastive method no matter how many tuples are there in the dataset. This phenomenon becomes more significant when the dataset grows larger. Note that the main factor on execution time is the process of evaluating the chromosomes (data group), i.e., computing normalized CBSQS. our method can significantly prune the number of candidate data groups, leading to fewer times of fitness evaluations.

We also conducted experiment to observe the performance distribution of average fitness during the generations. As the results illustrated in Figure 6.2, the average fitnesses become better as the number of generation increases, and the results converge to the best solution at the 8^th generation when choose “0” on attribute “NPG”, and converge at the 94^th generation when choose “0” on attribute “PGL”.

Figure 6.2 Average fitness v.s number of generation on different dij

6.2 Experimental Results on Solution Correctness

In the second part of our experiments, we tested the solution correctness of our method on the FDA Adverse Event Reporting System dataset [3] from January 1 to March 31 in 2004 (2004Q1 in short), which is released by U.S. Food and Drug Administration (FDA), containing the adverse drug reaction reports. Because there hold a large number of missing values in several attributes, we choosed 4 attributes with the fewest missing data, including “EVENT_DT”, “GNDR_COD”, “AGE”, and

“WT”. We divided the attribute “EVENT_DT” in to three attributes, say “Year”,

“Month”, and “Day”, so that we obtained totally 6 attributes in the dataset. Since the values on attribute “AGE” and “WT” are continuous data, we divided the value into nine age levels and ten weight levels according to Gaubius method and body surface

area method [4] as shown in Table 6.3 and Table 6.4. Finally, the detail of the resulting dataset is illustrated in Table 6.5 and Table 6.6.

Table 6.3 Age level divided by Gaubius method.

<1 1~2 2~3 3~4 4~7 7~14 14~20 20~60 >60

Table 6.4 Weight level divided by body surface area method.

<2.5 2.5~3.2 3.2~4.5 4.5~10 10~15 15~23 23~30 30~40 40~54 >54

Table 6.5 The characteristic of 2004 Q1 FAERS data set.

Data set Number of records Number of attributes

FAERS 2004 Q1 18140 6

Table 6.6 The descriptions of the attributes included in the Pima Indians Diabetes dataset.

No. Attribute Description

1 Year Year of the adverse event occurred or began.

2 Month Month of the adverse event occurred or began.

3 Day Date of the adverse event occurred or began.

4 AGE Numeric value of patient's age at event.

5 WT Numeric value of patient's weight.

6 GNDR_COD Code for patient's sex.

According to the study by Pearson [13] “January 1” is a common disguise value used as a surrogate for “data unknown” in entering Event Date data into the FAERS system. Similarly, the first day of other months, such as “February”, “March”, “April”, is also very likely used as a disguise. For this reason, we expect to find that when selecting “January 1” and “February 1” as disguise values vd can find the best discovered data groups will consist of attribute “January” and “February”, respectively. Table 6.5 shows the experimental results on both GA-based method and exhaustive method. Zero values mean the attributes are used for evaluating the degree of fitness.

Table 6.7 The solutions comparison between genetic algorithms approach and exhaustive method.

vd Gender Year Month AGE WT fitness GAs based method

January 1

male 2003 January 0 0 0. 8863 Exhaustive method

(pruning)

male 2003 January 0 0 0. 8863

Exhaustive method male 1998 January 0 0 0.9595

GAs based method

February 1

male 2004 February 0 0 0.4182 Exhaustive method

(pruning)

male 2004 February 0 0 0.4182

Exhaustive method male 2000 February 0 0 0.6358

In this experiment, we performed two different exhaustive approaches, with or without executing record-based pruning presented in Section 5.5. The solutions generated by these two different exhaustive methods were different for both "January

1" and "February 1". Specifically, the exhaustive method without record-based pruning exhibit significant better solution than that without pruning. A further inspection showed that the solution found by the exhaustive method without record-based pruning only contains 18 and 20 of 18140 in table T~_g

. Since a small T~_g

loses good representation of the original dataset and leads to biased results, we choose to use the exhaustive method with record-based pruning when comparing with our GA-based method.

For “January 1” as the disguise value, the system returned g* composed of

“January” on “Month” and “male” on “Gender”, and for "February 1" it returned

“February” on “Month” and “male” on “Gender”. Obviously, our proposed GA-based approach can find the same solutions generated by the exhaustive method. However, the results do not exactly match the analysis conducted by Pearson. This is because the statistical analysis in [13] only considered a day of month, not the whole data set including other attributes, such as attribute “Gender”.

Chapter 7 Conclusions and Future Work

7.1 Conclusions

The problem of detecting the data group most prone to a specific disguise value is a novel issue of detecting disguise missing data, which has not yet been addressed before. In this thesis, we proposed a genetic algorithms based approach that can effectively find the optimal solution to the problem. We use and modify the CBSQS-based method proposed by Hua and Pei [6] to devise a fitness function which can successfully evaluate the fitness of candidate chromosomes. We also develop some effective optimization techniques to avoid unnecessary exploration of the candidate space. Experimental results showed that our method can discover the same optimal results generated by exhaustive method and the discovered data group is the same as that derived by previous work that relied on tedious statistics based manual examinations.

7.2 Future Work

Although our genetic algorithms based method has shown its effectiveness and correctness in finding the data group most prone to a specific disguised missing data, there are some improvements of our work worthy of further investigation in the future, which are summarized as follows:

1. A recent work by Belen has shown the benefit of replacing the CBSQS function by their developed chi-square test based function to measure the similarity of two tables. We will adopt the chi-square test based function to

our method as the fitness function for evaluating whether the data group is good.

2. Our developed genetic algorithms based method though can effectively discover the optimal solutions to the problem, requires lots of computations.

In the future, we will develop more efficient method, such as a greedy based approach.

References

[1] R. Belen, ”Detecting disguised missing data,” Master Thesis, The Middle East Technical University, February, 2009.

[2] R. Belen, T. T. Temizel, ”A framework to detect disguised missing data,” in Knowledge Discovery Practices and Emerging Applications of Data Mining:

Trends and New Domains, A.V. Senthil Kumar, Eds. USA: IGI Global, 2010, pp.

1-22.

[3] FDA Adverse Event Reporting System, Available:

http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveilla nce/AdverseDrugEffects/ucm083765.htm, [Jun. 19, 2013].

[4] Hsin Chu General Hospital, Department of Health, Executive Yuan, R.O.C., Available: https://dss.hch.gov.tw/other8.asp, [Jun. 19, 2013].

[5] J. Holland, Adaptation in Natural and Artificial Systems, Cambridge, MA: MIT Press, 1992.

[6] M. Hua and J. Pei, “Cleaning disguised missing data: a heuristic approach,” in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, pp. 950-958.

[7] M. Hua and J. Pei, “DiMaC: A system for cleaning disguised missing data,” in International Conference on Management of Data, Vancouver, BC, Canada, June 2008, pp. 9-12.

[8] R. Little and D. Rubin, Statistical Analysis with Missing Data, Wiley Publishers, New York, 1987.

[9] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs.

Berlin: Springer, 1994.

[10] M. Mitchell, An Introduction to Genetic Algorithms. Cambridge, MA: MIT Press,

1996.

[11] H. Mühlenbein, “How genetic algorithms really work: I. Mutation and hillclimbing,” in Parallel Problem Solving from Nature 2, Reinhard Manner, Bernard Manderick, Eds. Brussels, Belgium, September 1992, pp. 15–25.

[12] K. Natarajan, J. Li, and A. Koronios, ”Detecting mis-entered values in large data sets,” in Proceedings of the 4th World Congress on Engineering Asset Management, Athens, Greece, 2009, pp. 805-812.

[13] R. K. Pearson,”The problem of disguised missing data,” in ACM SIGKDD Explorations Newsletter, Vol. 8, No. 1, pp. 83-92, June 2006.

[14] P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Boston, MA:

Pearson Education, Inc., 2006.

[15] UCI Machine Learning Repository: Pima Indians Diabetes Data Set, Available:

http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes, [Jun. 19, 2013].

在文檔中特定偽缺漏值好發生之資料群集之偵測 (頁 37-0)