Chapter 1 Introduction
1.3 Thesis Organizations
The other chapters of this thesis are organized as follows.
In Chapter 2, we describe related background knowledge, including missing data, disguised missing data, the concept of genetic algorithms, and present previous work on detecting disguised missing data.
In Chapter 3, we focus on the method of EUS heuristic. We separate the chapter into three sections, including the concept of EUS sampling, the measurement of unbiased sample, and the framework of EUS algorithm.
Some scenarios are listed in Chapter 4 to help better understand the problem, and we also propose the formal definitions in this chapter.
The detail of the problem solution based on genetic algorithms is presented in Chapter 5, including the representation of chromosomes, selecting mechanism, operations of crossover and mutation, and the fitness function for evaluating the fitness of chromosomes.
Chapter 6 shows the experimental results we performed over Pima Indian Diabetes dataset and FDA Adverse Event Reporting System dataset (FAERS) . The experiments consist of two parts: The first part compared the execution time of our genetic algorithms based method to an exhaustive method. The second part evaluated the effectiveness of our proposed method.
Finally, we conclude the thesis and describe the future work in Chapter 7.
Chapter 2
Background and Related Work
2.1 Missing Data
Missing data is one of the important issues in data cleansing. Obviously, a data table contains missing data means that there exist some empty entries in the table.
Missing data can arise from poorly designed questionnaires, question omission of the interviewers, or non-response subject of the interviewees. Most important of all, a dataset with missing data may lead to a bias in the analysis result. According to the distribution of the missing entries, Little and Rubin [8] classified missing data into three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).
Data missing completely at random refers to missing entries randomly distributed in any attribute of the dataset. For example, an online application form disconnected while users filled the form.
Data which are missing at random refer to those randomly distributed in some specific subsets. For example, people lack of education may not be able to complete all the questions in the system. That is, this type of missing data happens because of some other non-missing attributes in the dataset.
Data missing not at randomly refer to the missing data are not randomly distributed in the dataset. For example, people who have high income may intend not to reveal their information on attribute “Salary” or “Tax”. The occurrence of this type of missing data is related to their individual information on other non-missing attributes.
2.2 Disguised Missing Data
A disguised missing data is a special kind of missing data. Briefly, a disguised missing data is also missing in the dataset, but the entry in the dataset is not null and cannot reflect the fact. In other words, the entries are filled with some fake values. The following are some cases that may raise disguised missing data.
Case 1: A poor design questionnaire. Traditional questionnaires usually have very limited options for listing questions, so may be inadequate to cover most cases. For example, consider the situation that an American born Chinese rented a hotel in Taiwan.
He found that on the hotel reservation form, the options on field “City” only list the cities in Taiwan. As a result, the waiter suggested that he use the city of the hotel instead.
Case 2: User does not intend to provide a correct value. There often exist several sensitive information in most of the questionnaires, such as birth date, salary, tax, etc.
Many of the users may not want to reveal the true data on this kind of sensitive attributes, so they might provide incorrect values to these entries. These entries are not missing in the table, but the values cannot reflect the true information.
Case 3: User does not provide a value. In this case, the data is explicitly missing, and this case can become two situations. First, lack of a standard missing data representation may lead to a disguised missing data. In the past, there exist many kinds of measurement dealing with missing data, such as filling the missing entry by 0 or coded the entry as “NS” (not specified), “UNK” (unknown), etc. The missing data become not missing, but the data still cannot reflect the fact in the real world.
Second, an online system may have some default value for some attributes. The following example illustrates the situation.
Example 1: Consider a user filling an online application form. There might be
some private information such as attribute “Marital Status“, “Gender“, “Height“,
“Weight“, etc. Two choices exist for attribute “Marital Status“: “single“ or “married“.
The system may have some default value, say single in this example. Many of the users do not want to reveal their true information or want to skip for filling these attributes. As a result, the default value “single” on attribute “Marital Status“ is used, which clearly is a disguise.
2.3 Genetic Algorithms
The genetic algorithm (GA) is first proposed by Holland in 1975 [5] . It is an approach for finding optimal or near optimal solutions to problems. The process starts with a randomly generated chromosomes population of each of which corresponds to a candidate solution, and chooses the parents and generates the offspring imitating the operations of crossover and mutation. All chromosomes are evaluated using a fitness function to determine the fitness values, which are used to decide whether the chromosomes should be eliminated or retain. The better performed chromosomes are preserved and the worse will be discarded. The new population replaces the old one and the process repeats until satisfy the terminate condition. The chromosome with the highest fitness value becomes the solution. A general process of genetic algorithms is illustrated in Figure 2.3.
Figure 2.1 An illustration of genetic algorithms.
2.4 Related Work
Disguised missing data was first defined by Pearson [13] in 2006 that he analyzed the problem of both missing data and disguised missing data. In the study, Pearson described the source of disguised missing data, and illustrated the influence of disguised missing data on simple statistics, hypothesis tests, correlations and regression models, classification trees, then discussed if the record should be ignored or not. The disguise value is semi-artificially looking forward by finding unusual values or patterns in the dataset.
Hua and Pei [6] first proposed an automatic system for detecting disguised missing data in 2007, called EUS heuristic, which is based on the concept of embedded unbiased sampling. This method finds the unbiased sample based on the correlation-based sample quality score (CBSQS), and finally output the suspect disguised missing value on each attribute. The method is primarily aiming at detect the
Randomly generate the initial parent population.
Calculate the degree of fitness for each chromosome.
Is termination condition satisfied?
Output
Select parents.
Perform crossover operation.
Perform mutation operation.
Yes
No
Generate the new population.
also be applied to detecting the second type of disguise missing value, no mechanism has been developed to locate the most data group that their method can be applied. In other words, the users have to test all of the possible data groups using their method to find the most suspected group.
In 2009, Belen modified the EUS heuristic method, replacing the evaluation of unbiased sample from CBSQS by a chi-square two sample test. The chi-square two sample test can check whether two samples are come from the same distribution and need not to specify if it is of common distribution or not. This approach solves the deficiency that data dependency may exist between pairs of attributes values. That is, this approach can also applied to the types of disguised missing data missing completely at random and data missing at random.
Natarajan et al. proposed another approach for detecting disguised missing data in large dataset [12] Their method is used in the field of detecting and correcting the disguise entries such as heuristic approach, partial domain knowledge and univariate methods, relying on the association rules between attribute values. Intuitively, this approach is not totally automatically.
In previous studies, the detection of disguised missing data has changed from artificial to automatic systems, but usually focus on obtaining the suspect disguised missing data on each attribute. Although some of these approaches can be applied to find out different types of suspected disguise values, they provide no mechanism to figure out which subgroup of the dataset is most likely to holds the second type of disguised missing data that is missing at random. Therefore, we propose a method, which is based on genetic algorithms, to search the group most prone to a specific disguise value in this thesis.
Chapter 3
Embedded Unbiased Sample Based Detection of Disguise Value
3.1 Embedded Unbiased Sample Heuristic
Hua and Pei [6] proposed a heuristic method to detect suspicious disguised missing data. The method is based on the concept of embedded unbiased sampling. An unbiased sample is a subset that presents similar characteristics and distribution as the original dataset. Before introducing the unbiased sampling based heuristic method, we first present some definitions and assumption made by Hua and Pei.
Firstly, they assume that on an attribute, there often exist only a small number of disguises that are frequently used by the disguised missing data. Those values are called the frequently used disguises. Secondly, they assume the disguised tuples are randomly distributed in the whole dataset.
Let T be the truth table and T~
be the recorded table. TA=v is call the projected database of v that all the tuples in TA=v contain value v on attribute A. For simplicity, we denote TA=v as Tv. The basic concept behind the embedded unbiased sampling is better explained with an example.
Consider Example 1. Let Tsingle be the projected database of single on attribute
“Marital Status”. Conceptually, Tsingle can be divided into two exclusive subsets Rsingle
and Ssingle, where Rsingle contains all tuples having value “single” on attribute “Marital Status” and those data are not missing in the truth table, which Ssingle contains those tuples whose values on attribute “Marital Status” are disguised missing and the value
“single” is used as disguise value. Figure 3.1 shows the relationship between Tsingle,
Rsingle, and Ssingle.
Figure 3.1 The EUS heuristic.
Based on the assumption that disguise tuples are randomly distributed in the whole dataset and if value “single” is frequently used as a disguised missing data in these tuples, the subset Ssingle will be an unbiased sample of the fact table except the attribute “Marital Status”. Likewise, the subset Tmarried, which contains “married” on attribute ”Marital Status”, can also be divided into Rmarried and Smarried. If value “single”
is used more frequently then “married” as a disguise value on attribute ”Marital Status”, then Tsingle from T~single
should be larger than Smarried from T~married .
Hua and Pei define the embedded unbiased sample heuristic (EUS heuristic for short) as follows: If v is frequently used as a disguise value on attribute A, then there exists a large subset Sv T~Av
such that Sv is an unbiased sample of T~
except for attribute A.
According to the EUS heuristic, Sv is an unbiased sample of T~
. The larger Sv the more frequently v is used as a disguise. If value v is frequently used as disguises, it is call a frequent disguise value.
Unfortunately, Sv is unknown and hard to compute from T~
. In order to find frequently used disguises, the EUS heuristic suggests a heuristic approach to detect those values. On each attribute, it is necessary to find a small number of attribute
“single” in the fact table
Ssingle:All disguised missing tuples using disguise value “single”
values whose projected databases contain a large subset as an unbiased sample of the whole data table. Those attribute values are suspects of frequently used disguise values.
The larger the unbiased sample subset, the more likely the value is a disguise value. So it is required to find the maximal embedded unbiased sample Mv, called MEUS for short. The relationship between Tv, Mv, and Sv is shown in Figure 3.2.
Figure 3.2 The relationship between Tv, Mv, and Sv [6] .
3.2 CBSQS: Measurement of Unbiased Sample
Here comes an important technical challenge: how can we measure whether a subset is an unbiased sample? The table in question is of multiple attributes, and measuring whether two multidimensional datasets having a similar distribution is a complex problem.
On observing that correlation usually can capture the distribution of a data set nicely, Hua and Pei propose a correlation-based approach to measure whether ~'
T is a good sample of T~
. The idea is: if the values correlated in T~
are also correlated in
~'
T , and vice versa, values correlated in ~'
T are also correlated in T~
, then T~ and
~'
T are of similar distribution.
Because computing all possible combinations of values is too costly, they choose only computing the correlation between two values vi and vj. The correlation between
Tv: the projected table
Mv: the maximal embedded unbiased sample Sv: the disguised missing set
vi and vj is given by:
then the similarity between ~'
T and T~
measure by the correlation-based sample quality score, CBSQS in short, and denoted as ~') q’ is to imitating the Minkowski distances [14] . Note that the score obtain by CBSQS is a non-negative number. The higher the score of subset ~'
T , the better ~' T is a unbiased sample of T~
.
Now the kernel step to fulfill the EUS heuristic is to find the maximal embedded unbiased sample Mv corresponding to a value v of attribute A. For this purpose, Hua
maximizing the DV-score. That is, candidates of frequently used disguises, and the second one is post processing phase, during which the results from phase one are forwarded to domain experts or other data cleaning algorithms for validation.
Phase 1: Mining candidates of frequent disguise values
Input: A table T and a threshold of number of candidates k Output: k candidates of frequent disguises on each attribute Method:
1. for each attribute A do 2. //applicability test
check whether the projected databases of most (frequent) values on A are unbiased sample of T, if so, break;
3. for each value v on A do derive Mv;
4. find the top k value(s) with the best and largest Mv's;
end for
Phase 2: Postprocessing: verify the candidates of frequent disguise values
In general, Mv in step 3 of phase one is costly when database T~
is large. Thus, a greedy method is adopted in [6] for deriving Mv. The basic idea is depicted in Figure 3.4.
Figure 3.4 An illustration of the greedy approach.
The projected database of v is used as the initial sample. On each iteration, every tuple in the sample will be removed from the current table and calculate the dv-score
t0, t1, …, tn
t1, t2, …, tn t0, t2, …, tn t0, t1, t3, …, tn … t0, t1, …, tn-1
… Table T~
and
value v on attribute A
Is the largest dv-score gain positive?
No
Yes: continues
Mv Terminate when no DV-score gain is positive then output the final subset Mv
Step 2: Compute dv-score gain of every subset of T~v
to the original dataset Step 1: Obtain projected database T~v
of value v, consisting of tuples t0~tn.
Step 3: Preserve the subtable whose dv-score gain is positive and the largest.
gain after removing this tuple. The subset with positive and largest dv-score gain will replace the current sample. The iteration continues and terminates when the dv-score cannot be improved anymore. The sample at the end is output as the approximate Mv.
The greedy approach generates approximate MEUSs for every value v on attribute A. The EUS algorithm only has to compare the size of each Mv to find top k candidate values. These candidates then are verified by domain experts or other algorithms, just as the second phase shown in Figure 3.3.
Chapter 4
Problem Description
4.1 Preliminary
According to the study by Little and Rubin [8] missing data can be classified into three types on account of their distribution in the dataset: missing completely at random, missing at random, and missing not at random.
As above described in Section 2.2, disguised missing data is a special kind of missing data, therefore disguise value can also be divided into three types. In our study, we focus on the disguised missing data that is missing at random. That is, a disguise value is randomly distributed in a specific subset of the whole database. For example, when customers are filling an application form on the internet, they may not want to reveal their private information such as birth date, age, country, etc. A man, for example, whose “Birth date” is “February 29th”, after entering “February” to
“Month”, intends not to disclose his true information on “Birth date”. So he chooses the default value, says “1” for “Day”.
Similarly, there may also be some other customers born on “February” choosing
“February 1st” as a disguise. As a result, “February 1st” becomes a disguise value on the subset containing “February” on attribute “Month” though it is usually not a disguise value on the whole dataset; a typical scenario of disguised data missing at random.
To our knowledge, all previous work on detecting disguised missing data focuses on the first type, i.e., missing completely at random, no study devoted to finding out the data group most prone to a specific disguise value. In the following section, we will
solution presented in Chapter 5.
4.2 Formal Definition
Following the notation used in Chapter 3, let T~
denote the recorded table of T with attributes A = {A1, A2, …, An}, and Dom(Ai) be the set of values for attribute Ai, 1
i n. Given a suspected disguise value v, for v Dom(D) and D A, we like to discover if v is indeed a disguise value, the data group of T~
that is most prone to using v as a disguise value. To facilitate the discussion, we first formalize the term data group.
Definition 4.1. A data group (G1 = g1, G2 = g2, …, Gp = gp) defined on a attribute subset {G1, G2, …, Gp} A, identifies the projection of T~
on G1 = g1, G2 = g2, …, Gp
= gp. That is, the set of tuples in group (G1 = g1, G2 = g2, …, Gp = gp) all have the same values on attributes G1, G2, …, Gp. Hereafter, as it is clear from the context, we use (g1, g2, …, gp) instead of (G1 = g1, G2 = g2, …, Gp = gp).
Now let G denote the set of all data groups induced by the attribute set A – {D}.
Note that we need at least one attribute other than the grouping attributes and the disguise attribute D to perform the EUS procedure. It is noteworthy that the empty group means no projection is performed on the original table T~
. So this case corresponds to the discovery of the maximal embedded unbiased sample on T~v
. In this context, the problem discussed in [6] can be regarded as a special case of our problem.
Based on the concept of maximal embedded unbiased sample, we can formalize the problem of detecting the data group most prone to the specific disguise value v as finding the best data group g* in G that maximizes Eq. (3.5). Since the searching of maximal embedded unbiased sample is performed on the projection of T~
on v
associated with group g, i.e., T~v,g
, instead of T~v
, the problem is now formalized as
This value, however, is proportional to the cardinality of the table of concern. The larger cardinality (number of value pairs) of table T~
, the larger this value is. In order to not favor larger projections of T~
Similarly, we introduce the normalize DV-score of v in T~
, denoted as ndv(v,T~ normalized DV-score ndv(v,T~g
), i.e.,
which can be rewritten as
The complexity of finding g* is immense. Let mi be the cardinality of attribute Ai
in T~, 1 i n. Without loss of generality, we choose A1 as the suspected attribute D.
Each attribute Aj, 2 j n, can take either one of mj different values if being involved in forming the data group or take the empty value if not being involved, leading to at
Each attribute Aj, 2 j n, can take either one of mj different values if being involved in forming the data group or take the empty value if not being involved, leading to at