• 沒有找到結果。

Chapter 2 Background and Related Work

2.4 Related Work

Disguised missing data was first defined by Pearson [13] in 2006 that he analyzed the problem of both missing data and disguised missing data. In the study, Pearson described the source of disguised missing data, and illustrated the influence of disguised missing data on simple statistics, hypothesis tests, correlations and regression models, classification trees, then discussed if the record should be ignored or not. The disguise value is semi-artificially looking forward by finding unusual values or patterns in the dataset.

Hua and Pei [6] first proposed an automatic system for detecting disguised missing data in 2007, called EUS heuristic, which is based on the concept of embedded unbiased sampling. This method finds the unbiased sample based on the correlation-based sample quality score (CBSQS), and finally output the suspect disguised missing value on each attribute. The method is primarily aiming at detect the

Randomly generate the initial parent population.

Calculate the degree of fitness for each chromosome.

Is termination condition satisfied?

Output

Select parents.

Perform crossover operation.

Perform mutation operation.

Yes

No

Generate the new population.

also be applied to detecting the second type of disguise missing value, no mechanism has been developed to locate the most data group that their method can be applied. In other words, the users have to test all of the possible data groups using their method to find the most suspected group.

In 2009, Belen modified the EUS heuristic method, replacing the evaluation of unbiased sample from CBSQS by a chi-square two sample test. The chi-square two sample test can check whether two samples are come from the same distribution and need not to specify if it is of common distribution or not. This approach solves the deficiency that data dependency may exist between pairs of attributes values. That is, this approach can also applied to the types of disguised missing data missing completely at random and data missing at random.

Natarajan et al. proposed another approach for detecting disguised missing data in large dataset [12] Their method is used in the field of detecting and correcting the disguise entries such as heuristic approach, partial domain knowledge and univariate methods, relying on the association rules between attribute values. Intuitively, this approach is not totally automatically.

In previous studies, the detection of disguised missing data has changed from artificial to automatic systems, but usually focus on obtaining the suspect disguised missing data on each attribute. Although some of these approaches can be applied to find out different types of suspected disguise values, they provide no mechanism to figure out which subgroup of the dataset is most likely to holds the second type of disguised missing data that is missing at random. Therefore, we propose a method, which is based on genetic algorithms, to search the group most prone to a specific disguise value in this thesis.

Chapter 3

Embedded Unbiased Sample Based Detection of Disguise Value

3.1 Embedded Unbiased Sample Heuristic

Hua and Pei [6] proposed a heuristic method to detect suspicious disguised missing data. The method is based on the concept of embedded unbiased sampling. An unbiased sample is a subset that presents similar characteristics and distribution as the original dataset. Before introducing the unbiased sampling based heuristic method, we first present some definitions and assumption made by Hua and Pei.

Firstly, they assume that on an attribute, there often exist only a small number of disguises that are frequently used by the disguised missing data. Those values are called the frequently used disguises. Secondly, they assume the disguised tuples are randomly distributed in the whole dataset.

Let T be the truth table and T~

be the recorded table. TA=v is call the projected database of v that all the tuples in TA=v contain value v on attribute A. For simplicity, we denote TA=v as Tv. The basic concept behind the embedded unbiased sampling is better explained with an example.

Consider Example 1. Let Tsingle be the projected database of single on attribute

“Marital Status”. Conceptually, Tsingle can be divided into two exclusive subsets Rsingle

and Ssingle, where Rsingle contains all tuples having value “single” on attribute “Marital Status” and those data are not missing in the truth table, which Ssingle contains those tuples whose values on attribute “Marital Status” are disguised missing and the value

“single” is used as disguise value. Figure 3.1 shows the relationship between Tsingle,

Rsingle, and Ssingle.

Figure 3.1 The EUS heuristic.

Based on the assumption that disguise tuples are randomly distributed in the whole dataset and if value “single” is frequently used as a disguised missing data in these tuples, the subset Ssingle will be an unbiased sample of the fact table except the attribute “Marital Status”. Likewise, the subset Tmarried, which contains “married” on attribute ”Marital Status”, can also be divided into Rmarried and Smarried. If value “single”

is used more frequently then “married” as a disguise value on attribute ”Marital Status”, then Tsingle from T~single

should be larger than Smarried from T~married .

Hua and Pei define the embedded unbiased sample heuristic (EUS heuristic for short) as follows: If v is frequently used as a disguise value on attribute A, then there exists a large subset SvT~Av

such that Sv is an unbiased sample of T~

except for attribute A.

According to the EUS heuristic, Sv is an unbiased sample of T~

. The larger Sv the more frequently v is used as a disguise. If value v is frequently used as disguises, it is call a frequent disguise value.

Unfortunately, Sv is unknown and hard to compute from T~

. In order to find frequently used disguises, the EUS heuristic suggests a heuristic approach to detect those values. On each attribute, it is necessary to find a small number of attribute

“single” in the fact table

Ssingle:All disguised missing tuples using disguise value “single”

values whose projected databases contain a large subset as an unbiased sample of the whole data table. Those attribute values are suspects of frequently used disguise values.

The larger the unbiased sample subset, the more likely the value is a disguise value. So it is required to find the maximal embedded unbiased sample Mv, called MEUS for short. The relationship between Tv, Mv, and Sv is shown in Figure 3.2.

Figure 3.2 The relationship between Tv, Mv, and Sv [6] .

3.2 CBSQS: Measurement of Unbiased Sample

Here comes an important technical challenge: how can we measure whether a subset is an unbiased sample? The table in question is of multiple attributes, and measuring whether two multidimensional datasets having a similar distribution is a complex problem.

On observing that correlation usually can capture the distribution of a data set nicely, Hua and Pei propose a correlation-based approach to measure whether ~'

T is a good sample of T~

. The idea is: if the values correlated in T~

are also correlated in

~'

T , and vice versa, values correlated in ~'

T are also correlated in T~

, then T~ and

~'

T are of similar distribution.

Because computing all possible combinations of values is too costly, they choose only computing the correlation between two values vi and vj. The correlation between

Tv: the projected table

Mv: the maximal embedded unbiased sample Sv: the disguised missing set

vi and vj is given by:

then the similarity between ~'

T and T~

measure by the correlation-based sample quality score, CBSQS in short, and denoted as ~') q’ is to imitating the Minkowski distances [14] . Note that the score obtain by CBSQS is a non-negative number. The higher the score of subset ~'

T , the better ~' T is a unbiased sample of T~

.

Now the kernel step to fulfill the EUS heuristic is to find the maximal embedded unbiased sample Mv corresponding to a value v of attribute A. For this purpose, Hua

maximizing the DV-score. That is, candidates of frequently used disguises, and the second one is post processing phase, during which the results from phase one are forwarded to domain experts or other data cleaning algorithms for validation.

Phase 1: Mining candidates of frequent disguise values

Input: A table T and a threshold of number of candidates k Output: k candidates of frequent disguises on each attribute Method:

1. for each attribute A do 2. //applicability test

check whether the projected databases of most (frequent) values on A are unbiased sample of T, if so, break;

3. for each value v on A do derive Mv;

4. find the top k value(s) with the best and largest Mv's;

end for

Phase 2: Postprocessing: verify the candidates of frequent disguise values

In general, Mv in step 3 of phase one is costly when database T~

is large. Thus, a greedy method is adopted in [6] for deriving Mv. The basic idea is depicted in Figure 3.4.

Figure 3.4 An illustration of the greedy approach.

The projected database of v is used as the initial sample. On each iteration, every tuple in the sample will be removed from the current table and calculate the dv-score

t0, t1, …, tn

t1, t2, …, tn t0, t2, …, tn t0, t1, t3, …, tnt0, t1, …, tn-1

Table T~

and

value v on attribute A

Is the largest dv-score gain positive?

No

Yes: continues

Mv Terminate when no DV-score gain is positive then output the final subset Mv

Step 2: Compute dv-score gain of every subset of T~v

to the original dataset Step 1: Obtain projected database T~v

of value v, consisting of tuples t0~tn.

Step 3: Preserve the subtable whose dv-score gain is positive and the largest.

gain after removing this tuple. The subset with positive and largest dv-score gain will replace the current sample. The iteration continues and terminates when the dv-score cannot be improved anymore. The sample at the end is output as the approximate Mv.

The greedy approach generates approximate MEUSs for every value v on attribute A. The EUS algorithm only has to compare the size of each Mv to find top k candidate values. These candidates then are verified by domain experts or other algorithms, just as the second phase shown in Figure 3.3.

Chapter 4

Problem Description

4.1 Preliminary

According to the study by Little and Rubin [8] missing data can be classified into three types on account of their distribution in the dataset: missing completely at random, missing at random, and missing not at random.

As above described in Section 2.2, disguised missing data is a special kind of missing data, therefore disguise value can also be divided into three types. In our study, we focus on the disguised missing data that is missing at random. That is, a disguise value is randomly distributed in a specific subset of the whole database. For example, when customers are filling an application form on the internet, they may not want to reveal their private information such as birth date, age, country, etc. A man, for example, whose “Birth date” is “February 29th”, after entering “February” to

“Month”, intends not to disclose his true information on “Birth date”. So he chooses the default value, says “1” for “Day”.

Similarly, there may also be some other customers born on “February” choosing

“February 1st” as a disguise. As a result, “February 1st” becomes a disguise value on the subset containing “February” on attribute “Month” though it is usually not a disguise value on the whole dataset; a typical scenario of disguised data missing at random.

To our knowledge, all previous work on detecting disguised missing data focuses on the first type, i.e., missing completely at random, no study devoted to finding out the data group most prone to a specific disguise value. In the following section, we will

solution presented in Chapter 5.

4.2 Formal Definition

Following the notation used in Chapter 3, let T~

denote the recorded table of T with attributes A = {A1, A2, …, An}, and Dom(Ai) be the set of values for attribute Ai, 1

 i  n. Given a suspected disguise value v, for v  Dom(D) and D  A, we like to discover if v is indeed a disguise value, the data group of T~

that is most prone to using v as a disguise value. To facilitate the discussion, we first formalize the term data group.

Definition 4.1. A data group (G1 = g1, G2 = g2, …, Gp = gp) defined on a attribute subset {G1, G2, …, Gp}  A, identifies the projection of T~

on G1 = g1, G2 = g2, …, Gp

= gp. That is, the set of tuples in group (G1 = g1, G2 = g2, …, Gp = gp) all have the same values on attributes G1, G2, …, Gp. Hereafter, as it is clear from the context, we use (g1, g2, …, gp) instead of (G1 = g1, G2 = g2, …, Gp = gp).

Now let G denote the set of all data groups induced by the attribute set A – {D}.

Note that we need at least one attribute other than the grouping attributes and the disguise attribute D to perform the EUS procedure. It is noteworthy that the empty group means no projection is performed on the original table T~

. So this case corresponds to the discovery of the maximal embedded unbiased sample on T~v

. In this context, the problem discussed in [6] can be regarded as a special case of our problem.

Based on the concept of maximal embedded unbiased sample, we can formalize the problem of detecting the data group most prone to the specific disguise value v as finding the best data group g* in G that maximizes Eq. (3.5). Since the searching of maximal embedded unbiased sample is performed on the projection of T~

on v

associated with group g, i.e., T~v,g

, instead of T~v

, the problem is now formalized as



This value, however, is proportional to the cardinality of the table of concern. The larger cardinality (number of value pairs) of table T~

, the larger this value is. In order to not favor larger projections of T~

Similarly, we introduce the normalize DV-score of v in T~

, denoted as ndv(v,T~ normalized DV-score ndv(v,T~g

), i.e.,

which can be rewritten as

The complexity of finding g* is immense. Let mi be the cardinality of attribute Ai

in T~, 1  i  n. Without loss of generality, we choose A1 as the suspected attribute D.

Each attribute Aj, 2  j  n, can take either one of mj different values if being involved in forming the data group or take the empty value if not being involved, leading to at most (m2 + 1)  (m3 + 1)  …  (mn + 1) different data groups. But note that at least one attribute has to be excluded in forming the data group, meaning that we have to discount all the cases that all attributes are involved in forming the data group. Then, the number of all possible data groups induced by the set {A2, A3, …, An} is

Example 4.1. Let us consider Table 4.1. Suppose we choose “male” on “Gender”

as the suspected disguise value v. Then the number of data groups induced by attributes

“Martial Status”, “Literacy”, and “Education” is (|Dom(Martial Status)| + 1) (|Dom(Literacy)| + 1)  (|Dom(Education)| + 1) – (|Dom(Martial Status)| 

|Dom(Literacy)|  |Dom(Education)|) = 33 –23 = 19. Specifically, let consider the group defined on Martial Status = “married”. Table 4.2 shows the resulting projection

married

T~

on this group, wherein the shaded part corresponds to further projection on

“Male”, say T~male,married

. According to Eq. (4.3), we have to find the maximal subset of T~male,married

that resembles (an unbiased sample of) T~married

. This process continues for the projections defined on all other groups to determine the best data

group g*.

Table 4.1 An example of dataset.

Gender Marital Status Literacy Education

Male Married Literate High school

Male Single Literate High school

Male Married Illiterate High school

Male Single Illiterate High school

Male Married Literate College

Male Single Literate College

Male Married Illiterate College

Male Single Illiterate College

Male Married Literate High school

Male Single Literate High school

Female Married Illiterate High school

Female Single Illiterate High school

Female Married Literate College

Female Single Literate College

Female Married Illiterate College

Female Single Illiterate College

Male Married Literate High school

Female Single Literate College

Female Married Illiterate College

Female Single Illiterate High school

Table 4.2 The resulting projection of Table 4.1 on “Married”.

Gender Marital Status Literacy Education

Male Married Literate High school

Male Married Illiterate High school

Male Married Literate College

Male Married Illiterate College

Male Married Literate High school

Male Married Literate High school

Female Married Illiterate High school

Female Married Literate College

Female Married Illiterate College

Female Married Illiterate College

Chapter 5

The Proposed GA-based Detection Method

In this chapter, we introduce a genetic algorithms based method for detecting the data group most prone to a specific disguise value. In Section 5.1, we first describe the general framework of our approach, and then detail individually the main components in subsequent subsections, including the chromosome representation, the operations of crossover, mutation, selection, and the fitness function to evaluate the chromosome.

Some candidate pruning condition will be shown in Section 5.5.

5.1 General Framework

Figure 5.1 shows the general framework of our proposed genetic algorithms based (GA-based) method. The input of our algorithm is a recorded table T~

and a suspected disguise value v which we intend to detect the data group most prone to it.

Figure 5.1 A general framework of the GA-based method.

5.2 Chromosome Representation

The first step and the most important part of GAs is the chromosome representation. A chromosome representation is an encoding of a possible solution of the problem. In our study, we encode each solution into a vector of non-repeated decimal integers. A non-zero integer indicates the corresponding attribute values are used for forming the data group, while a zero value represents the attribute is not included in forming the group, i.e., it is used for evaluating the degree of fitness using the EUS heuristic.

For example, consider a four attribute table T, whose attribute A1 contains two Input: A table T~

and a suspected disguise value v Output: The chromosome with best degree of fitness Method:

Initialize the parameters;

Generate a population P randomly;

generation ← 1;

while generation max_gener do Clear the new population P’;

Evaluate the fitness of each individual in P;

while P '  population_size do Select two parents from P;

Perform crossover operation;

Perform mutation operation;

Put the offspring into P’;

endwhile P ← P’;

generation ← generation + 1;

endwhile

We choose v11 on A1 and v32 on A3 for grouping and leave A2 and A4 as the attribute for evaluating the degree of fitness of each chromosome. The chromosome can be represented as shown in Figure 5.1.

Figure 5.2 An example for chromosome representation.

5.3 Evolutionary Operations

It is necessary to choose the parent chromosomes from the population before evolutionary operation, which is called a selection. According to the evolution principle, choosing the chromosomes with higher degree of fitness can generate better population. However, this approach may lose the diversity of the population because of restricting the possible solutions. The population will converge too quickly and may not be able to find the optimal solution. In this study, we adopted the tournament selection method proposed by Mitchell [10] in 1996, which randomly chooses the parent chromosomes from the current population and a random number r between 0 and 1 then compare with a predefined value, usually set as 0.75. If r is less than the value, then we choose the chromosome with higher fitness value. On the other hand, if r is larger, then we choose the chromosome with lower degree of fitness. The chromosome can also be selected at the next time. We also adopted the elitism principle proposed by Mitchell [10] that the best chromosome should be preserved into

 

evaluating the degree of fitness.

the new population.

The crossover operation is used to generate the offspring in GAs, by exchanging the chromosome in two parents chosen from population. Our method adopts one-point crossover, which is one of the most common crossover operations. This operation works by first selecting a crossover point randomly, dividing the pair of parents by this point, and then exchanging the gene sequence to form the offspring. The operation is shown in Figure 5.2.

Figure 5.3 One-point crossover.

Mutation operation is used to increase genetic diversity. In our method, the position for mutation is selected randomly. The value of the selected gene mutates in the following way. If the gene is zero, it is changed to a random non-zero integer. On the other hand, if the gene is non-zero, it changes to another integer including zero.

Mutation operation is used to increase genetic diversity. In our method, the position for mutation is selected randomly. The value of the selected gene mutates in the following way. If the gene is zero, it is changed to a random non-zero integer. On the other hand, if the gene is non-zero, it changes to another integer including zero.

相關文件