• 沒有找到結果。

it must help companies and researchers to public use of information without the hassle of obtaining real data. In this research, we concentrate our effort on generating such realistic synthetic data that would be statistically sound and good enough for application purposes, e.g., machine learning tasks.

In this study, we perform data synthesis in two different domains–electronic health records (EHRs) in healthcare and crime data in the city of Los Angeles Police Department.

Patients’ electronic health records (EHRs) contribute considerably to the medical industry and to research on topics such as developing medical software, developing new drugs, investigating diseases, and inventing cure and preventive measures for advancing medical informatics and healthcare. However, EHR data often consist of highly sensitive or regulated medical information about patients. In general, patients are not comfortable disclosing their personal data. Owing to the legal, privacy, and security concerns surrounding medical data and limited access to those data, the healthcare sector lags behind other sectors in terms of employing information technology, data exchange, and interoperability [10]. Hence, the primary focus of this research is to synthesizing realistic EHR data. Crime data also consist of a lot of confidential information of the crime incidents committed in a certain region or city. Therefore, it is also a good choice of data synthesis as well as for ensuring that the proposed method is generic and can be applied to a wide range of applications. The idea of developing the proposed method is discussed in the later section just after the related works of this study.

1.2 Related Works

In this section, we start with a discussion of the history of synthetic data generation. This is followed by a literature survey on the recent works for generating synthetic data in the healthcare domain. To our knowledge, there has not been any study on synthesizing crime

1.2.1 History of synthetic data generation

The history of synthetic data generation (SDG) starts from 1993. In 1993, DB Rubin proposed the idea of creating synthetic micro-data sets for public use based on the concepts of multiple imputation [3]. This has the advantage of completely protecting individual confidentiality, as well as providing users with access to data wherever they wish, but imposes substantial data producer costs and has been resisted by the user community because of data quality concerns as pointed by Abowd et al. in [11]. In the same year, Little [12] in a general discussion of the analysis of masked data, presented the possibility of simulating only variables that are penitential identifiers. Little used the likelihood-based method to synthesize the sensitive values on the public use file. Gray et al. [13] came up with the idea of quickly generating large synthetic database by using parallel algorithms and execution in 1994. In 1998, Fienberg et al. [14] refined their the idea described earlier in [15–17] and used the sample cumulative distribution functions and bootstrapping to construct synthetic, categorical data. Later, other important contributors to the development of synthetic data generation were TE Raghunathan, JP Reiter and DB Rubin. Collectively in 2003, they came up with a solution for how to treat partially synthetic data with missing data and the technique of Sequential Regression Multivariate Imputation [18]. In 2006, Pei and Zaïane presented a distribution-based and transformation-based approach in [19] to synthetic data generation for clustering and outlier analysis. They were able to systematically produce testing datasets based on user’s requirements such as the number of points, the number of clusters, the size, shapes, and locations of clusters, and the density level of either cluster data or noise/outliers in a dataset. Houkjær et al. developed a generic, DBMS independent, and highly extensible relational data generation tool [20]. The tool used a graph model to generate realistic test data for OLTP, OLAP, and data streaming applications. In 2009, Christen and Pudjijono developed a data generator and presented in [21] that allows flexible creation of synthetic data containing personal information with realistic characteristics. In 2011, Bozkurt and Harman

proposed a novel automated solution in [22] to test data generation from web services. In the experimental analysis, their prototype tool achieved between 93% and 100% success rates in generating realistic data using service compositions.

1.2.2 Recent works in healthcare

The synthetic data generation (SDG) recently has attracted the interest of both academia and industry. Since our main research-focus is to synthesize EHR data in healthcare, we surveyed the recent works of this domain here. Some notable works on SDG across a wide range of healthcare domains can be found in [10, 23–26]. However, many of such methods often are disease-specific, not realistic, work only on several variables of EHR data, or yet have a privacy concern. For example, an early innovative method EMERGE developed by Lombardo and Moniz [23] and later improved by Buczak et al. [24] generates synthetic EHR data for an outbreak illness of interest (tularemia) and was potentially susceptible to re-identification.

McLachlan et al. develop an approach [25] that uses a health incidence statistics (HIS) and clinical practice guidelines (CPG) based CareMap for generating synthetic EHR. The main problem of this approach was that they did not use any real EHR data and hence need further experiment to guarantee the realistic properties. Park et al. conduct a good work [26], related to our research but it can handle only a few dimensions of binary data.

Very recently, an excellent framework of SDG named Synthea [10] has been developed to provide risk-free EHR data suited to industrial, research, and educational uses but it is still not validated to work on diverse diseases and treatment modules. McLachlan [27] also performs a comprehensive domain analysis and validation of different SDG approaches.

However, it is still a challenging problem to generate realistic synthetic EHR data. In addition to preserving statistical features of the real data, synthetic data should verify its functionality for the relevant applications. For instance, as Choi et al. investigated in [28] that in practice, the resulting synthetic EHR data are often not sufficiently realistic for machine learning tasks,

相關文件