以進階生成對抗網路合成擬真資料 - 政大學術集成

全文

(1)國立政治大學資訊科學系 Department of Computer Science National Chengchi University. 博士學位論文. 立. Doctoral 政治Thesis. 大. ‧ 國. 學. 以進階生成對抗網路合成擬真資料. ‧. Realistic Data Synthesis Using Enhanced Generative. sit. y. Nat. io. n. al. er. Adversarial Networks. Ch. engchi. i n U. v. Student: Mrinal Kanti Baowaly 包諾克 Advisors: Dr. Sheng-Wei Chen 陳昇瑋 Dr. Chao-Lin Liu 劉昭麟. 中華民國一百零八年五月 May 2019 DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(2) Realistic Data Synthesis Using Enhanced Generative Adversarial Networks. Student: Mrinal Kanti Baowaly Advisors: Dr. Sheng-Wei Chen. 立. 政治大 Dr. Chao-Lin Liu. ‧. ‧ 國. 學 y. Nat. n. er. io. sit. A Thesis submitted to Department of Computer Science al v National Chengchi University i n Ch i URequirements in partial fulfillment e n g cofh the for the degree of Ph.D. in Social Networks and Human-Centered Computing (SNHCC). May 2019 DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(3) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(4) 立. 政治大. I dedicate this thesis to my loving grandmother, late Puspa Rani Sarkar who raised me and. ‧ 國. 學. contributed the most in my life to reach this stage. The gracious lady, of a blessed soul, passed away during my Ph.D. study in Taiwan. . .. ‧. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.


(6) Declaration. This work was performed under the supervision of Prof. Sheng-Wei Chen and Prof. Chao-Lin Liu. I hereby declare that except where specific reference is made to the work of others,. 政治大 part for consideration for any 立other degree or qualification in this, or any other university. the contents of this dissertation are original and have not been submitted in whole or in. ‧ 國. 學. An article with a significant part of the content used in this dissertation has been published in the Journal of the American Medical Informatics Association (JAMIA) [1]. Based on. ‧. the proposed work in this study, another paper has been accepted for a presentation to the Doctoral Consortium of the IEEE AIKE 2019 Conference [2]. This dissertation is my own. y. Nat. sit. work and contains nothing which is the outcome of work done in collaboration with others,. n. al. er. io. except as specified in the dissertation.. Ch. engchi. i n U. v. Mrinal Kanti Baowaly May 2019. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(7) Acknowledgements. First of all, I would like to thank the Almighty God for giving me the strength, courage, and patience to perform this research. I would like to express my heartfelt gratitude to my. 政治大 opportunity to conduct this 立 research under his supervision. Prof. Chen believed in me and advisor Prof. Sheng-Wei Chen for admitting me to join his research lab and for providing the. ‧ 國. 學. provided an outstanding work environment which ultimately led me to successfully complete this study. I am grateful for his generous financial support for maintaining my academic and. ‧. research activities as well as living expenses in Taiwan. I would also like to show my sincere gratitude to Prof. Chao-Lin Liu for co-advising this research. During my tenure in Taiwan, I. y. Nat. sit. am very much fortunate to have both professors as my advisors and hence I acknowledge. n. al. er. io. their constant guidance, motivation, and advice with the greatest appreciation.. i n U. v. I would also like to thank the Taiwan International Graduate Program for providing me. Ch. engchi. the Ph.D. fellowship in Social Network and Human-Centered Computing (SNHCC) and for supporting my research. I am also grateful to the SNHCC program, the Institute of Information Science (IIS) in Academia Sinica, Taiwan. I certainly pay my gratitude and thanks to National Chengchi University, Taiwan and the Department of Computer Science for their continuous guide and assistance during my study there. Special thanks to Dr. Mark Liao for reviewing my dissertation carefully and helping me to improve it. I also thank all the faculty members, researchers, and staffs in IIS of Academia Sinica for their ceaseless encouragement, guidance, and concern. I am thankful for my labmates from Data Insights Research Lab and to my SNHCC friends for all their. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(8) viii support and the wonderful moments that we experienced together in Taiwan. My sincere thanks to Ms. Chia-Chien of SNHCC program as she was always in touch during this Ph.D. I am forever thankful to my beloved wife Lipika Sarder for the sacrifices, inspiration, and support that she made so that I could complete this degree and return to my country. I would like to pay my immense gratitude and respect to my parents and other family members and the relatives for their eternal love, encouragement, and prayers. I also thank my teachers, colleagues, students, and friends, who were always there for making good wishes and cheering me up throughout this whole journey.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(9) Abstract. There are many situations when the real data are not available or are too expensive to afford in respect of both time and money. This is because those data may have privacy and. 政治大 The primary objective of this立 study is to generate realistic synthetic electronic health records. confidentiality concerns. In these situations, it is a good alternative to use synthetic data.. ‧ 國. 學. (EHRs) so that people can use it freely for progressing research in healthcare or related fields. We propose two synthetic data generation models – designated as medical Wasserstein GAN. ‧. with gradient penalty (medWGAN) and medical boundary-seeking GAN (medBGAN) – and compare the performances with an existing method medical GAN (medGAN). The proposed. y. Nat. sit. models are based on the two enhanced methods of generative adversarial networks (GANs),. n. al. er. io. namely, Wasserstein GAN with gradient penalty (WGAN-GP) and boundary-seeking GAN. i n U. v. (BGAN). We perform data synthesis on three aggregated EHR datasets with discrete features. Ch. engchi. (e.g., binary and count) in the medical domain. They are MIMIC-III, extended MIMIC-III and National Health Insurance Research Database (NHIRD), Taiwan. Firstly, we train the models and generate synthetic EHR data by using these trained models. We then analyze and compare the models’ performance by applying some statistical methods (dimension-wise average and Kolmogorov–Smirnov test) and two machine learning tasks (association rule mining and prediction). The comprehensive analysis of this study shows that the proposed models are more effective in generating realistic synthetic EHR data than those generated using medGAN.. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(10) x Our models can be applied to generate any realistic synthetic data, even beyond the medical domain. To prove the generality of our models, we also investigate an aggregated crime dataset in the City of Los Angeles Police Department apart from the medical domain which confirms our models’ capability to work in a wide range of applications. We prove that the proposed models are suitable for producing high-quality synthetic data with discrete features that are statistically sound and good enough for machine learning tasks. We believe the proposed models will be effective in industry and research from the viewpoint of providing better services in generating realistic synthetic data. This study will help to eliminate barriers. 政治大 informatics, healthcare or related fields. 立. including limited access to confidential data and thus accelerate the development of medical. ‧ 國. 學. Keywords: Electronic Health Records, Synthetic Data Generation, Data Synthesis,. io. sit. y. Nat. n. al. er. seeking GANs. ‧. Generative Adversarial Networks, Wasserstein GANs with Gradient Penalty, Boundary-. Ch. engchi. i n U. v. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(11) 摘要. 真實資料在許多情況下無法取得，或者在時間和金錢方面都太昂貴。這是因為這些資料. 政治大. 可能存在隱私和保密問題。在這些情況下，使用合成資料是一個可行的選擇。本研究的. 立. 主要目的是生成近乎真實的合成電子健康記錄（EHR），以便人們可以自由地使用，進. ‧ 國. 學. 行醫療保健或相關領域的研究。我們提出了兩種合成資料的生成模型，分別稱為具有梯. ‧. 度懲罰的醫學沃瑟斯坦 GAN（medWGAN），以及醫學邊界尋求 GAN（medBGAN），. Nat. io. sit. y. 並且將其表現與現有的醫學 GAN（medGAN）進行比較。本研究所提出的模型是基於. al. er. 生成對抗網絡（ GAN ）的兩種增強方法，即具有梯度懲罰的沃瑟斯坦 GAN. n. v i n C（BGAN）（WGAN-GP），以及邊界尋求 GAN （例如， h e n g。我們在醫學領域中具有離散特徵 chi U 二元和計數）的三個匯總 EHR 資料集上進行資料合成，分別是 MIMIC-III，擴展的 MIMIC-III，以及台灣國家健康保險研究資料庫（NHIRD）。首先，我們訓練上述模型並生成合成 EHR 資料。接著，我們應用統計方法（維度平均值以及柯爾莫哥洛夫-斯米爾諾夫檢定）和兩個機器學習任務（關聯規則挖掘以及預測）來分析和比較模型的表現。綜合分析的結果顯示，與使用 medGAN 模型相比，本研究所提出的模型在生成近乎真實的合成 EHR 資料方面是更為有效的。. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(12) 我們的模型可用於生成任何近乎真實的合成資料，而不限於醫學領域。為了證明模型的一般性，在醫學領域之外，我們還研究了洛杉磯市警察局的一個匯總的犯罪資料集，進一步證實了本研究所提出的模型在廣泛應用中的能力。我們證明本研究所提出的模型可用於生成具有離散特徵的高品質合成資料，這些資料在統計上是合理的，並且足以用於機器學習任務。我們相信，以提供更好的服務來生成近乎真實的合成資料的角度來看，本研究所提出的模型將在工業和學術研究中起到作用。本研究將有助於消除機. 政治大. 密資料的存取限制等障礙，從而加速醫學資訊學、醫療保健或相關領域的發展。. 立. ‧ 國. 學. 關鍵字：電子健康記錄，合成資料生成，資料合成，生成對抗網路，梯度懲罰型沃瑟斯. ‧. 坦 GAN，邊界尋求 GAN。. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(13) Table of contents. List of figures List of tables Introduction. xxi. ‧ 國. 1. Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1.2. Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2. 1.2.1. 3. ‧. 1.1. History of synthetic data generation . . . . . . . . . . . . . . . . .. y. Nat. Recent works in healthcare . . . . . . . . . . . . . . . . . . . . . .. io. sit. 1.2.2. Developing Idea of the Proposed Method . . . . . . . . . . . . . . . . . .. 1.4. Objective and Contribution of this Research . . . . . . . . . . . . . . . . .. 1.5. Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . .. al. er. 1.3. n. 2. 政治大. 學. 1. 立. xvii. Ch. engchi. i n U. v. 4 5 6 7. Materials and Methods. 9. 2.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 2.2. Synthetic Data and Its Applications . . . . . . . . . . . . . . . . . . . . .. 9. 2.3. The Importance of Synthetic Data . . . . . . . . . . . . . . . . . . . . . .. 11. 2.4. AI-based Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . .. 12. 2.4.1. Generative adversarial networks (GANs) . . . . . . . . . . . . . .. 12. 2.4.2. Wasserstein GAN with gradient penalty (WGAN-GP) . . . . . . .. 14. 2.4.3. Boundary-Seeking GAN (BGAN) . . . . . . . . . . . . . . . . . .. 15. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(14) xiv 2.5. Medical GAN (medGAN) . . . . . . . . . . . . . . . . . . . . . . . . . .. 17. 2.6. The Proposed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19. Synthesizing Electronic Health Records. 23. 3.1. Data Collection, Processing and Analysis . . . . . . . . . . . . . . . . . .. 23. 3.1.1. Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23. 3.1.2. Convert to aggregated (count) data . . . . . . . . . . . . . . . . . .. 24. 3.1.3. Convert to binary data . . . . . . . . . . . . . . . . . . . . . . . .. 25. 3.1.4. Statistics of datasets . . . . . . . . . . . . . . . . . . . . . . . . .. 政治大. 25. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 27. 3.2.1. Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . .. 29. 3.2.2. Training the models . . . . . . . . . . . . . . . . . . . . . . . . .. 30. 3.2.3. Methods for evaluating synthetic data . . . . . . . . . . . . . . . .. 31. 學. ‧. 34. 3.3.1. Dimension-wise average for binary data . . . . . . . . . . . . . . .. 34. 3.3.2. Dimension-wise average for count data . . . . . . . . . . . . . . .. 3.3.3. K–S test results . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.3.4. Association rule mining . . . . . . . . . . . . . . . . . . . . . . .. 39. 3.3.5. Dimension-wise prediction performance . . . . . . . . . . . . . . .. 40. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49. 3.4.1. Summary of the results . . . . . . . . . . . . . . . . . . . . . . . .. 49. 3.4.2. MIMIC-III versus extended MIMIC-III . . . . . . . . . . . . . . .. 49. 3.4.3. medWGAN versus medBGAN . . . . . . . . . . . . . . . . . . . .. 51. 3.4.4. Privacy consideration . . . . . . . . . . . . . . . . . . . . . . . . .. 52. y. Nat. Evaluation Results on Synthetic Data . . . . . . . . . . . . . . . . . . . . .. io. sit. 3.3. 立. n. al. er. 3.2. ‧ 國. 3. Table of contents. 3.4. 4. Ch. engchi. i n U. v. 35 38. Synthesizing Crime Data. 55. 4.1. 55. Data Collection, Processing and Analysis . . . . . . . . . . . . . . . . . .. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(15) xv. Table of contents. 55. 4.1.2. Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 56. 4.1.3. Statistics of the Crime Dataset . . . . . . . . . . . . . . . . . . . .. 56. Experiments on Crime Data . . . . . . . . . . . . . . . . . . . . . . . . . .. 57. 4.2.1. Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . .. 59. 4.2.2. Training the models . . . . . . . . . . . . . . . . . . . . . . . . .. 60. 4.2.3. Methods for evaluating synthetic data . . . . . . . . . . . . . . . .. 60. Evaluation Results on Synthetic Crime Data . . . . . . . . . . . . . . . . .. 61. 4.3.1 4.3.2 4.3.3. 62 63. Dimension-wise prediction performance . . . . . . . . . . . . . . .. 64. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 68. 4.4.2. medWGAN versus medBGAN . . . . . . . . . . . . . . . . . . . .. n. al. y. Summary of the results . . . . . . . . . . . . . . . . . . . . . . . .. io. 4.4.1. Concluding Remarks. 68. sit. ‧. 64. Nat. 5. 61. Association rule mining . . . . . . . . . . . . . . . . . . . . . . .. 4.3.5 4.4. K–S test results . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 學. 4.3.4. 政治大 Dimension-wise average for count data . . . . . . . . . . . . . . . 立. Dimension-wise average for binary data . . . . . . . . . . . . . . .. 68. er. 4.3. Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ‧ 國. 4.2. 4.1.1. Ch. engchi. i n U. v. 71. 5.1. Limitations and Future Works . . . . . . . . . . . . . . . . . . . . . . . .. 71. 5.2. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 72. 5.3. Funding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 73. 5.4. Competing Interests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 73. 5.5. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 73. References. 75. Appendix A How to install the models to generate synthetic data. 81. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.


(17) List of figures 2.1. The conceptual idea of GAN architecture. . . . . . . . . . . . . . . . . . .. 2.2. Gradient penalty in WGANs does not exhibit undesired behavior like weight. 15. Qualitative comparison between the conventional GAN and the proposed. ‧ 國. 學. 2.3. 政治大 clipping. . . . . .立 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 13. 16. 2.4. Original GAN and the baseline medGAN architecture. . . . . . . . . . . .. 17. 2.5. Comparative design components of the three generative models. . . . . . .. 20. 3.1. ECDFs of ICD codes and patients for MIMIC-III, extended MIMIC-III and. ‧. BGAN in 1-D examples. . . . . . . . . . . . . . . . . . . . . . . . . . . .. er. io. sit. y. Nat. NHIRD datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. n. al. 3.2. iv. n U e n (y-axis) g c h iproduced by the three generative axis) versus synthetic counterpart. Ch. Scatterplots of dimension-wise average count results on real binary data (x-. models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. 27. 35. Scatterplots of dimension-wise average count results on real count data (xaxis) versus synthetic counterpart (y-axis) produced by the three generative models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.4. 37. Scatterplots of dimension-wise prediction results (F1-scores) of logistic regression model trained on real binary data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models. . . . . . . . . . .. 42. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(18) xviii 3.5. List of figures Scatterplots of dimension-wise prediction results (F1-scores) of random forests model trained on real binary data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models. . . . . . . . . . . . . . .. 3.6. 43. Scatterplots of dimension-wise prediction results (F1-scores) of SVM model trained on real binary data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models. . . . . . . . . . . . . . . . . . .. 3.7. 44. Scatterplots of dimension-wise prediction results (F1-scores) of logistic regression model trained on real count data (x-axis) versus synthetic counter-. 政治大 Scatterplots of dimension-wise prediction results (F1-scores) of random 立. part (y-axis) produced by the three generative models. . . . . . . . . . . . . 3.8. 45. forests model trained on real count data (x-axis) versus synthetic counterpart. ‧ 國. 學. (y-axis) produced by the three generative models. . . . . . . . . . . . . . . Scatterplots of dimension-wise prediction results (F1-scores) of SVM model. ‧. 3.9. 46. trained on real count data (x-axis) versus synthetic counterpart (y-axis). Nat. 47. sit. y. produced by the three generative models. . . . . . . . . . . . . . . . . . .. al. er. io. 3.10 a,b: Sensitivity and precision when varying the number of known attributes.. v ni. n. c,d: Sensitivity and precision when varying the size of the synthetic dataset.. Ch. engchi U. 4.1. Total crimes reported by year. . . . . . . . . . . . . . . . . . . . . . . . . .. 4.2. Scatterplots of dimension-wise average count results on real binary data (x-. 52 59. axis) versus synthetic counterpart (y-axis) produced by the three generative models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3. 61. Scatterplots of dimension-wise average count results on real count data (xaxis) versus synthetic counterpart (y-axis) produced by the three generative models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 62. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(19) xix. List of figures 4.4. Scatterplots of dimension-wise prediction results (F1-scores) of logistic regression model trained on real data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models. . . . . . . . . . . . . . .. 4.5. 65. Scatterplots of dimension-wise prediction results (F1-scores) of random forests model trained on real data (x-axis) versus synthetic counterpart (yaxis) produced by the three generative models. . . . . . . . . . . . . . . . . Scatterplots of dimension-wise prediction results (F1-scores) of SVM model trained on real data (x-axis) versus synthetic counterpart (y-axis) produced. 政治大. by the three generative models. . . . . . . . . . . . . . . . . . . . . . . . .. 立. 67. 學 ‧. ‧ 國 io. sit. y. Nat. n. al. er. 4.6. 66. Ch. engchi. i n U. v. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.


(21) List of tables 3.1. A portion of sample count dataset . . . . . . . . . . . . . . . . . . . . . .. 25. 3.2. A portion of sample binary dataset . . . . . . . . . . . . . . . . . . . . . .. 26. 3.3. 政治大 Basic statistics of 立 datasets . . . . . . . . . . . . . . . . . . . . . . . . . . .. 27. Top frequent ICD codes of MIMIC-III . . . . . . . . . . . . . . . . . . . .. 28. 3.5. Top frequent ICD codes of NHIRD, Taiwan . . . . . . . . . . . . . . . . .. 28. 3.6. Top patients’ data of MIMIC-III and NHIRD datasets . . . . . . . . . . . .. 29. 3.7. Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 30. 3.8. Dimension-wise average for binary data . . . . . . . . . . . . . . . . . . .. 36. 3.9. Dimension-wise average for count data . . . . . . . . . . . . . . . . . . . .. 36. ‧. ‧ 國. 學. 3.4. n. er. io. sit. y. Nat. al. i n U. v. 3.10 K–S test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38. 3.11 Association rule mining results . . . . . . . . . . . . . . . . . . . . . . . .. engchi. 40. 3.12 Prediction performances of the three generative models . . . . . . . . . . .. 48. 3.13 Summary of prediction performances . . . . . . . . . . . . . . . . . . . .. 48. 3.14 Results summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 50. 3.15 All-zero dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 52. 4.1. A portion of sample crime count dataset . . . . . . . . . . . . . . . . . . .. 56. 4.2. A portion of sample crime binary dataset . . . . . . . . . . . . . . . . . . .. 57. 4.3. Basic statistics of the crime dataset . . . . . . . . . . . . . . . . . . . . . .. 57. 4.4. Top frequent crime codes . . . . . . . . . . . . . . . . . . . . . . . . . . .. 58. Ch. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(22) xxii. List of tables. 4.5. Top crime locations’ data of crime dataset . . . . . . . . . . . . . . . . . .. 58. 4.6. Experimental settings for crime dataset . . . . . . . . . . . . . . . . . . . .. 59. 4.7. Dimension-wise average for crime data . . . . . . . . . . . . . . . . . . .. 62. 4.8. K–S test results for crime data . . . . . . . . . . . . . . . . . . . . . . . .. 63. 4.9. Association rule mining results for crime data . . . . . . . . . . . . . . . .. 64. 4.10 Prediction performances for crime data . . . . . . . . . . . . . . . . . . . .. 67. 4.11 Summary of the prediction performances for crime data . . . . . . . . . . .. 67. 4.12 Results summary for crime data . . . . . . . . . . . . . . . . . . . . . . .. 69. 政治大. 4.13 All-zero dimensions in crime data . . . . . . . . . . . . . . . . . . . . . .. 立. 69. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.


(24) Chapter 1 Introduction. 立. ‧ 國. Background and Motivation. 學. 1.1. 政治大. Data are the most fundamental and vital aspect of any research studies. Whatever the methods. ‧. and techniques, the use of datasets is a common requirement for every research. But real. sit. y. Nat. data are not always freely available or are expensive. The main reason is that data may. io. al. er. sometimes consist of individuals’ sensitive or private information, and there may be risks of violating privacy and confidentiality [3]. When real data are not available or there is. n. v i n limited access, organizations usually data by using de-identification C h generate anonymized engchi U methods [4]. Anonymized data may enable the wider use of confidential data. However, de-identification techniques such as k-anonymity, l-diversity, and t-closeness used to create anonymized data are not robust against re-identification attacks [5, 6]. Multiple occurrences of re-identification of such anonymized records have already been witnessed and publicized in the past in [5, 7–9]. To circumvent this challenge, an alternative method is to generate synthetic data. The advantages of using synthetic data include they are artificially created and hence there is no explicit mapping of individuals between real and synthetic data. For this reason, unlike anonymized data, synthetic data stay resistant to re-identification; therefore, protect privacy and confidentiality. If synthetic data can carry the attributes of real data well,. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(25) 2. Introduction. it must help companies and researchers to public use of information without the hassle of obtaining real data. In this research, we concentrate our effort on generating such realistic synthetic data that would be statistically sound and good enough for application purposes, e.g., machine learning tasks. In this study, we perform data synthesis in two different domains–electronic health records (EHRs) in healthcare and crime data in the city of Los Angeles Police Department. Patients’ electronic health records (EHRs) contribute considerably to the medical industry and to research on topics such as developing medical software, developing new drugs,. 政治大 informatics and healthcare. However, EHR data often consist of highly sensitive or regulated 立 investigating diseases, and inventing cure and preventive measures for advancing medical. medical information about patients. In general, patients are not comfortable disclosing their. ‧ 國. 學. personal data. Owing to the legal, privacy, and security concerns surrounding medical data. ‧. and limited access to those data, the healthcare sector lags behind other sectors in terms of employing information technology, data exchange, and interoperability [10]. Hence, the. Nat. sit. y. primary focus of this research is to synthesizing realistic EHR data. Crime data also consist. al. er. io. of a lot of confidential information of the crime incidents committed in a certain region or. v i n proposed method is generic and canC behapplied to a wide range e n g c h i U of applications. The idea of n. city. Therefore, it is also a good choice of data synthesis as well as for ensuring that the. developing the proposed method is discussed in the later section just after the related works of this study.. 1.2. Related Works. In this section, we start with a discussion of the history of synthetic data generation. This is followed by a literature survey on the recent works for generating synthetic data in the healthcare domain. To our knowledge, there has not been any study on synthesizing crime data, this study first introduces it.. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(26) 3. 1.2 Related Works. 1.2.1. History of synthetic data generation. The history of synthetic data generation (SDG) starts from 1993. In 1993, DB Rubin proposed the idea of creating synthetic micro-data sets for public use based on the concepts of multiple imputation [3]. This has the advantage of completely protecting individual confidentiality, as well as providing users with access to data wherever they wish, but imposes substantial data producer costs and has been resisted by the user community because of data quality concerns as pointed by Abowd et al. in [11]. In the same year, Little [12] in a general discussion of the analysis of masked data, presented the possibility of simulating only. 政治大. variables that are penitential identifiers. Little used the likelihood-based method to synthesize. 立. the sensitive values on the public use file. Gray et al. [13] came up with the idea of quickly. ‧ 國. 學. generating large synthetic database by using parallel algorithms and execution in 1994. In 1998, Fienberg et al. [14] refined their the idea described earlier in [15–17] and used the. ‧. sample cumulative distribution functions and bootstrapping to construct synthetic, categorical data. Later, other important contributors to the development of synthetic data generation. y. Nat. io. sit. were TE Raghunathan, JP Reiter and DB Rubin. Collectively in 2003, they came up with. n. al. er. a solution for how to treat partially synthetic data with missing data and the technique of. i n U. v. Sequential Regression Multivariate Imputation [18]. In 2006, Pei and Zaïane presented a. Ch. engchi. distribution-based and transformation-based approach in [19] to synthetic data generation for clustering and outlier analysis. They were able to systematically produce testing datasets based on user’s requirements such as the number of points, the number of clusters, the size, shapes, and locations of clusters, and the density level of either cluster data or noise/outliers in a dataset. Houkjær et al. developed a generic, DBMS independent, and highly extensible relational data generation tool [20]. The tool used a graph model to generate realistic test data for OLTP, OLAP, and data streaming applications. In 2009, Christen and Pudjijono developed a data generator and presented in [21] that allows flexible creation of synthetic data containing personal information with realistic characteristics. In 2011, Bozkurt and Harman. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(27) 4. Introduction. proposed a novel automated solution in [22] to test data generation from web services. In the experimental analysis, their prototype tool achieved between 93% and 100% success rates in generating realistic data using service compositions.. 1.2.2. Recent works in healthcare. The synthetic data generation (SDG) recently has attracted the interest of both academia and industry. Since our main research-focus is to synthesize EHR data in healthcare, we surveyed the recent works of this domain here. Some notable works on SDG across a wide range of. 政治大. healthcare domains can be found in [10, 23–26]. However, many of such methods often are. 立. disease-specific, not realistic, work only on several variables of EHR data, or yet have a. ‧ 國. 學. privacy concern. For example, an early innovative method EMERGE developed by Lombardo and Moniz [23] and later improved by Buczak et al. [24] generates synthetic EHR data for. ‧. an outbreak illness of interest (tularemia) and was potentially susceptible to re-identification. McLachlan et al. develop an approach [25] that uses a health incidence statistics (HIS). y. Nat. io. sit. and clinical practice guidelines (CPG) based CareMap for generating synthetic EHR. The. n. al. er. main problem of this approach was that they did not use any real EHR data and hence. i n U. v. need further experiment to guarantee the realistic properties. Park et al. conduct a good. Ch. engchi. work [26], related to our research but it can handle only a few dimensions of binary data. Very recently, an excellent framework of SDG named Synthea [10] has been developed to provide risk-free EHR data suited to industrial, research, and educational uses but it is still not validated to work on diverse diseases and treatment modules. McLachlan [27] also performs a comprehensive domain analysis and validation of different SDG approaches. However, it is still a challenging problem to generate realistic synthetic EHR data. In addition to preserving statistical features of the real data, synthetic data should verify its functionality for the relevant applications. For instance, as Choi et al. investigated in [28] that in practice, the resulting synthetic EHR data are often not sufficiently realistic for machine learning tasks,. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(28) 5. 1.3 Developing Idea of the Proposed Method. e.g., predictive modeling. The goal of our research is to address all these issues mentioned above and propose a general model without focusing on any specific disease, number of dimensions and size of data. The model will be suitable for generating realistic synthetic EHR data that will be statistically sound as well as good enough for machine learning tasks.. 1.3. Developing Idea of the Proposed Method. Most recently, generative adversarial networks (GANs) [29] – a type of neural network, have a. 治政大 manner that may mislead a in generating high-quality synthetic images in an adversarial 立 person into accepting such images as original images. A GAN comprises two neural networks: hot research topic to both researchers and developers because of their remarkable performance. ‧ 國. 學. a generator (G) for generating fake but realistic images, and discriminator (D) for predicting (distinguishing) whether the input image is real or fake. Through the two competing G and D. ‧. networks, a GAN can generate synthetic images that are nearly indistinguishable from the. Nat. sit. y. real images. Leveraging this power of creating realistic synthetic images, GANs have been. io. al. er. successfully applied in many applications such as image generation [30–33], text-to-image synthesis [34, 35], image-to-image translation [36–38], video generation [39, 40], music. n. v i n C hassert that GAN is the generation [41] etc. All these works best choice for producing realistic engchi U synthetic samples. Since in this research our objective is to create realistic synthetic data, we were motivated by this amazing power of GAN and set the target to optimize it. Note that a. GAN exhibits remarkable performance in generating real-valued continuous data but it has limitations in generating discrete data [42–44]. A major reason is that a GAN fails to learn the distribution of discrete data in their original form during the gradient update process in training. This is because the discrete generation process has zero gradient nearly everywhere (and is otherwise infinite), so it is not possible to use back-propagation alone to train the generator [44]. To overcome this limitation, Choi et al. proposed an innovative approach called medical GAN (medGAN) [28] for synthesizing discrete EHR data. They incorporated. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(29) 6. Introduction. an autoencoder with the original GAN to learn the distribution of discrete data. Moreover, they incorporated the minibatch averaging method into the adversarial framework to prevent the problem of “mode collapse” encountered when a GAN tends to generate data with low diversity. Within the healthcare domain, the medGAN framework focuses on patients’ aggregated discrete features (e.g., binary and count features) derived from longitudinal EHRs for experimenting machine learning tasks. The authors achieved comparable performance to real data on many experiments including distribution statistics and predictive modeling task. Like medGAN, we also aim to synthesize discrete data and to do a better performance than. 政治大. medGAN using enhanced generative adversarial networks.. 立. Objective and Contribution of this Research. ‧ 國. 學. 1.4. Now, we can summarize the objectives of this study. The aim of this study is to create more. ‧. realistic synthetic data than those generated by medGAN. The induced synthetic data would. Nat. sit. y. be not only statistically sound but also good enough for application purposes, e.g., machine. io. al. er. learning tasks. The proposed method of this study would be generic and can be applied to any datasets. For this reason, we aim to investigate EHR data in healthcare domain and a. n. v i n C h To achieve theseUaims and objectives, we applied crime dataset in the City of Los Angeles. engchi two improved design concepts of the original GAN, namely Wasserstein GAN with gradient. penalty (WGAN-GP) [45] and boundary-seeking GAN (BGAN) [44] as alternatives to GAN in medGAN framework. We call the approaches medWGAN and medBGAN, respectively. The main contributions of the present study are as follows: • We introduce two effective models–medWGAN and medBGAN to generate more realistic synthetic EHR data than those generated by the existing medGAN method. In addition to the EHR data, we generate synthetic crime data in the City of Los Angeles.. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(30) 7. 1.5 Dissertation Organization. • We evaluate, compare, and analyze the performance levels of the proposed models with the baseline model using some statistical and machine learning methods. We observe that the proposed models are capable of learning the discrete attributes of real data accurately. • We prove that the proposed medWGAN and medBGAN outperform medGAN statistically as well as in machine learning tasks (association rule mining and prediction). • We also show that the proposed models are generic and can be applied to produce. 政治大. high-quality synthetic discrete data with both count and binary features in any domains.. Dissertation Organization. ‧ 國. 學. 1.5. 立. The rest of the paper is organized as follows: chapter 1 discuses background and significance,. ‧. related works, motivation, objective and contribution of this research. Chapter 2 describes. sit. y. Nat. synthetic data and its applications, importance of synthetic data, synthetic data generation, the. io. er. baseline medGAN, and the proposed models. Chapter 3 investigates synthesizing electronic health records (EHRs) that discusses its data collection, processing and statistical analysis. n. al. Ch. i n U. v. (section 3.1), experiments (section 3.2), evaluation results (section 3.3) and discussion. engchi. (section 3.4). Similarly, chapter 4 investigates synthesizing crime Data. Chapter 5 puts some concluding remarks and its future works along with funding, competing interests and the contributors. Appendix A discuses how to install the models to generate synthetic data.. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.


(32) Chapter 2 Materials and Methods 政治大立 Overview. ‧ 國. 學. 2.1. This chapter discusses the synthetic data and its applications, importance, various synthetic. ‧. data generations, followed by some generative techniques: generative adversarial networks. sit. y. Nat. (GANs), Wasserstein GAN with gradient penalty (WGAN-GP) and boundary-seeking GAN. io. al. er. (BGAN). It also describes the baseline model medical GAN (medGAN) and finally, presents. n. the proposed models of this study.. 2.2. Ch. engchi. i n U. v. Synthetic Data and Its Applications. Synthetic data are computer-generated data that mimic real data. The process may involve human actions to some extent or be an entirely automated process. It is often created with the help of algorithms and is used for a wide range of activities, including as test data for new products and tools, for model validation, and in AI needs, e.g., to train machine learning models [46]. Synthetic data are generated to meet specific needs or in certain conditions that may not be found in the original, real data or when real data are not freely available. This can be useful when designing and testing any type of system because the synthetic data. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(33) 10. Materials and Methods. are used as a simulation or as a theoretical value, situation, etc. This allows us to take into account unexpected results and have a basic solution or remedy, if the results are proven to be unsatisfactory. Synthetic data are often generated to represent the authentic data and allows a baseline to be set [47]. Another use of synthetic data is to protect the privacy and confidentiality of a set of authentic data. Real data contain personal/private/confidential information that a programmer, software creator or research project may not want to be disclosed [11]. Synthetic data hold no personal information and cannot be traced back to any individual; therefore, the use of synthetic data maintains confidentiality and privacy.. 政治大. Business functions that can benefit from synthetic data include [46]:. 立. • Machine learning: Synthetic data are gaining attraction within the machine learning. ‧ 國. 學. domain. Machine learning algorithms are trained using an immense amount of data, and collecting the necessary amount of labeled training data can be cost prohibitive.. ‧. Synthetically generated data can help companies and researchers build data repositories. y. Nat. needed to train and even pre-train machine learning models, a technique referred to as. io. sit. transfer learning. Research efforts to advance synthetic data use in machine learning are. n. al. er. underway. Computer vision, image recognition and robotics are additional applications. i n U. v. that are benefiting from the use of synthetic data. Self driving car simulation is a. Ch. engchi. leading example of using synthetic data.. • Agile development and DevOps: When it comes time for software testing and quality assurance, artificially generated data are often the better choice as it eliminates the need to wait for ‘real data’. Often referred to under this circumstance as ‘test data’. This can ultimately lead to decreased test time and increased flexibility and agility during development • Clinical and scientific trials: Synthetic data can be used as a baseline for clinical and scientific studies, testing when no real data yet exist.. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(34) 11. 2.3 The Importance of Synthetic Data. • Research: To help better understand the format of real data not yet recorded, develop understanding of its specific statistical properties, tune parameters for related algorithms, or build preliminary models. Since synthetic data are manufactured with similar attributes to actual sensitive or regulated data, this enables data professionals to use and share data more freely. Some industries that can benefit from synthetic data are healthcare and financial services [46, 48]: • Healthcare: Synthetic data enable healthcare data professionals to allow the public. 政治大 • Financial services: In立 the financial sector, synthetic data such as debit and credit card. use of record data while still maintaining the privacy and confidentiality of the patients.. ‧ 國. 學. payments that look and act like typical transaction data can help expose fraudulent activity. Data scientists can use synthetic data to test or evaluate fraud detection. io. al. y. er. The Importance of Synthetic Data. sit. Nat. 2.3. ‧. systems as well as develop new fraud detection methods.. n. v i n C hspend liberally. On smaller organizations can’t afford to e n g c h i U the other hand, synthetic data are. Original data with high quality are always expensive in both time and money to access that. inexpensive or freely available, and even allows one to explore niche applications where data would normally be extremely challenging to acquire, such as the health or satellite imaging fields. There are a number of benefits that synthetic data has over real data [46]: • Real data may have usage constraints due to privacy rules or other regulations. Synthetic data can learn all statistical properties of real data without exposing real data, thereby eliminating the issue.. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(35) 12. Materials and Methods • Synthetic data are either free or inexpensive in respect of time and money. Once the synthetic environment is ready, producing synthetic data is more cost-effective and faster than collecting real data. • Where real data do not exist, generally for training and testing the new systems, synthetic data are the only solution. • Where real data are not sufficient for ensuring the system performance, synthetic data are generated to serve the desired purpose that represents every possible situation.. 政治大. • Synthetic data aim to preserve the multivariate relationships between variables instead. 立. of specific statistics alone.. ‧ 國. 學. • Synthetic data can have perfectly accurate labels, including labeling that may be very expensive or impossible to obtain by hand.. ‧. io. sit. y. AI-based Synthetic Data Generation. Nat. 2.4. n. al. er. A variety of synthetic data generation (SDG) methods and enterprise level tools have been. Ch. i n U. v. developed across a wide range of domains. In section 1.2 (Related Works) of Introduction,. engchi. we pointed out some earlier and recent researches of SDG. The next subsections describe the AI-based SDG methods – Generative adversarial networks (GANs), Wasserstein GAN with gradient penalty (WGAN-GP), boundary-seeking GAN (BGAN) and the baseline method – medical GAN (medGAN), which are the foundation frameworks of our proposed models.. 2.4.1. Generative adversarial networks (GANs). The idea of the Generative adversarial network (GAN) framework by Ian J. Goodfellow et al. was introduced in the NIPS 2014 conference [29]. Yann LeCun, Director of AI Research at Facebook and Professor at NYU, said the following in his Quora session: [49] “(GANs), and. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(36) 13. 2.4 AI-based Synthetic Data Generation. the variations that are now being proposed is the most interesting idea in the last 10 years in ML, in my opinion.” The conceptual idea of a GAN architecture is shown in Figure 2.1. The main idea of GANs, as indicated by the authors, is to train two neural networks: a generator G that generates synthetic or fake samples from random noise, and a discriminator D that classifies whether those generated samples originate from the original data (real) or generator (fake). The training goal of G is to fool D into believing that the generated samples are real. On the other hand, D is trained rigorously with real and fake samples so that it can identify the. 政治大 “adversarial”), a GAN framework can produce realistic synthetic samples. This framework 立 samples from G as fake. By competing with each other between these two networks (thus the. resembles a two-player minimax game [29, 50].. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Fig. 2.1 The conceptual idea of GAN architecture. A commonly used analogy is that the generator (G) is akin to a forger (criminal) trying to produce counterfeit money and that the discriminator (D) is akin to the police attempting to detect the counterfeit money. The objective of the criminal is to counterfeit money such that the police cannot discriminate the counterfeit money from real money. By contrast, the police want to detect the counterfeit money as best as possible. Formally, the minimax game. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(37) 14. Materials and Methods. between G and D with the value function V (G, D) is as follows:. min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))] G. D. (2.1). where pdata is the data distribution and pz is the simple noise distribution (e.g., uniform distribution or spherical Gaussian distribution). Initially, G accepts a random prior z ∼ pz and generates synthetic samples for the certification of D. G is then fine-tuned training (updated parameters) by using the error signal from D through back-propagation. Both D and G iterate. 政治大 1. in optimizing the respective parameters θd and θg as follows:. 立. m. ∑ m i=1. log D(xi ) + log(1 − D(G(zi ))). (2.2). 學. θg ← θg + α∇θg. 1 m ∑ log D(G(zi)) m i=1. ‧. ‧ 國. θd ← θd + α∇θd. (2.3). io. er. GAN, equivalent to Figure 2.1, is shown in Figure 2.4a.. a. n. 2.4.2. sit. y. Nat. where m is the size of the mini-batch and α the step size. Another compact design of the. iv. l C gradient penalty (WGAN-GP) n Wasserstein GAN with. hengchi U. The authors of WGAN-GP model in [45] claimed that the previously developed model Wasserstein GAN (WGAN) [51] facilitates stable training but generates low-quality samples or fails to converge in some settings owing to the use of the weight-clipping technique. To overcome these issues, they offered an alternative method of weight clipping called gradient penalty, which entails penalizing the norm of the gradient of the discriminator (critic) with respect to its input. To demonstrate this, the authors trained WGAN critics with weight clipping and WGAN critics with gradient penalty to optimality on several toy distributions. They proved that gradient penalty in WGANs did not exhibit undesired behavior like weight. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(38) 15. 2.4 AI-based Synthetic Data Generation. clipping. Weight clipping in WGANs failed to capture higher moments of the data distribution and exploded or vanished gradient norms during training. The results are shown in Figure 2.2.. 政治大. (a) Value surfaces of WGAN critics trained to optimality on toy datasets using (top) weight clipping and (bottom) gradient penalty. Critics trained with weight clipping fail to capture higher moments of the data distribution. The ‘generator’ is held fixed at the real data plus Gaussian noise.. 立. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. (b) (left) Gradient norms of deep WGAN critics during training on toy datasets either explode or vanish when using weight clipping, but not when using a gradient penalty. (right) Weight clipping (top) pushes weights towards two values (the extremes of the clipping range), unlike gradient penalty (bottom).. Fig. 2.2 Gradient penalty in WGANs does not exhibit undesired behavior like weight clipping.. 2.4.3. Boundary-Seeking GAN (BGAN). In GAN-based approach, a generator is trained to match a target distribution that converges toward the true distribution of the data as the discriminator is optimized. Under this inter-. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(39) 16. Materials and Methods. pretation, the learning objective of a generator is to minimize the difference between the discriminator’s log-probabilities for the sample being positive and negative. In boundaryseeking GAN (BGAN) [44], this objective has been inferred as training a generator to create samples that lie on the decision boundary of the current discriminator in training at each update. Hence, the GAN trained using this algorithm is called BGAN.. 立. 政治大. ‧ 國. 學. (a) Early stage of learning. (b) Late stage of learning. ‧. Fig. 2.3 Qualitative comparison between the conventional GAN and the proposed BGAN in 1-D examples.. sit. y. Nat. al. er. io. The BGAN authors qualitatively analyzed the difference between the conventional GAN. v i n variable, and drew 20 samples as C each generated sets of samples from two U h eof nrealg and i h c n. and the proposed BGAN as shown in Figure 2.3. They considered an one-dimensional. Gaussian distributions considering two cases; (a) early stage of learning, and (b) late stage. of learning. They used –2 and 2 in the first case, and –0.1 and 0.1 in the second case, for the centers of those two Gaussians. The variances were set to 0.3 for both distributions. As shown in Figure 2.3a, the authors ploted both real and generated samples on the x-axis (y = 0). The solid red curve corresponds to the discriminator D(x) above, and its log-gradient (∂ log D(x)/∂ x) is drawn with a dashed red curve. It is clear that maximizing log D(x), as conventionally done with GAN, pushes the generator beyond the real samples (orange circles). On the other hand, the proposed criterion of BGAN has its minimum at the decision boundary of the discriminator (blue curve). Minimizing this criterion has the effect of pushing the. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(40) 17. 2.5 Medical GAN (medGAN). generated samples, or correspondingly the generator, toward the real samples, but never beyond the region occupied by them. The issue is much more apparent in Figure 2.3b, where the real and generated samples are extremely close to each other. The proposed BGAN encourages the generator to stay close to the center of real samples, while the conventional objective pushes the generated samples beyond the real samples.. 2.5. Medical GAN (medGAN). 治政大 by leveraging the power of authors of the medGAN framework ameliorated this limitation 立 autoencoders [28]. The autoencoder is pretrained before GAN training. The general idea of 學. ‧ 國. As mentioned, the original GAN can only learn the distribution of continuous values, and the. ′. an autoencoder is mapping an input dataset x to an output x (called reconstruction) through an internal representation or hidden layer h. An autoencoder comprises two components:. ‧. ′. an encoder h = Enc(x) and a decoder x = Dec(h) [52]. This autoencoder mechanism is. Nat. sit. y. widely used to learn the salient features of training samples in various modern neural network. io. n. al. er. applications [53, 54].. Ch. (a) Original GAN architecture. engchi. i n U. v. (b) medGAN architecture. Fig. 2.4 Original GAN and the baseline medGAN architecture.. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(41) 18. Materials and Methods In medGAN framework (Figure 2.4b), an autoencoder is used to capture the salient. features of the discrete variables and decode the continuous output of G. Hence, the objective of the autoencoder is to minimize the reconstruction error:. ′ 2 1 m . x − x. ∑ i i 2 m i=0. (2.4). ′ ′ 1 m xi log xi + (1 − xi ) log(1 − xi ) ∑ m i=0. (2.5). where m is the size of the mini-batch (discussed later). The medGAN used the mean squared. 政治大. loss (Equation 2.4) for count variables and cross entropy loss (Equation 2.5) for binary. 立. variables. For count variables, it used rectified linear units (ReLU) as the activation function. ‧ 國. 學. in both Enc and Dec. For binary variables, it used tanh activation for Enc and the sigmoid activation for Dec. As shown in Figure 2.4b, the continuous output of the generator G(z) is. ‧. passed through the decoder Dec. Dec can select the appropriate distribution from G(z) and. y. Nat. yield the discrete output xz = Dec(G(z)). The discriminator D can now determine whether. io. sit. this synthetic discrete sample xz is fake or real. medGAN is trained in a similar fashion as the. n. al. er. original GAN. The pre-trained parameter of the decoder θdec is fine-tuned while optimizing. Ch. i n U. v. for G. medGAN used ReLU for all of G’s activation functions, except for the output layer in. engchi. binary data, where it used the tanh function. For D, it used ReLU for all activation functions except for the output layer in binary data, where it used the sigmoid function. Another performance-enhancing technique used in medGAN framework is mini-batch averaging. Occasionally, in a GAN, G with different random priors z may produce the same synthetic output rather than diverse outputs because of the min-max optimization strategy of the GAN instead of max-min [50]. In medGAN framework, mini-batch averaging mitigates this “mode collapse” problem and significantly improves the model performance in terms of generating discrete synthetic data.. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(42) 19. 2.6 The Proposed Models. Algorithm 1 describes the overall optimization process of medGAN [28]. Both the encoder Enc and the decoder Dec of autoencoder are single layer feedforward neural networks, where the original input x is compressed to a 128 dimensional vector. The generator G is implemented as a feedforward neural network with two hidden layers, each having 128 dimensions. The discriminator D is also a feedforward neural network with two hidden layers where the first layer has 256 dimensions and the second layer has 128 dimensions.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. 2.6. Ch. engchi. i n U. v. The Proposed Models. In this section, we discuss the concept and design of our proposed models–defined as medical Wasserstein GAN with gradient penalty (medWGAN) and medical boundary–seeking GAN (medBGAN). Figure 2.5 shows the design components of the proposed generative models along with baseline medGAN model.. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(43) 20. Materials and Methods. (a) medGAN. (b) medWGAN. (c) medBGAN. 政治大. Fig. 2.5 Comparative design components of the three generative models.. 立. In proposed medWGAN, we employ an improved generative network called Wasserstein. ‧ 國. 學. GAN with gradient penalty or WGAN-GP instead of the general GAN (Figure 2.5b). The remainder of the structure is the same as that of medGAN shown in Figure 2.5a. The. ‧. WGAN-GP performs better in respect to training speed and sample quality than many GAN. y. Nat. architectures, including the standard WGAN. It enhances training stability and helps models. io. sit. to converge better. Hence, in this investigation, we hypothesize that applying medWGAN to. n. al. er. generate synthetic EHR data would yield superior performance to that achieved by applying the original medGAN.. Ch. engchi. i n U. v. Our another proposed model is medBGAN, and we achieved this model by replacing the traditional GAN in medGAN framework with a new advanced algorithm called BoundarySeeking GAN or BGAN (Figure 2.5c). The BGAN algorithm effectively works on both discrete and continuous variables and shows qualitatively superior performance levels to those of conventional GANs. Similar to medWGAN, medBGAN is expected to exhibit high performance in terms of generating synthetic EHR data. In previous section 2.5, we described the design concept, internal architecture, and implementation detail of medGAN. Like medGAN, discriminator D and G in our models are implemented with feedforward neural networks. The generative process of synthetic. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(44) 21. 2.6 The Proposed Models. data retained for each of the proposed models as medGAN, shown in Figure 2.4b and the Algorithm 1. We also maintained the same hyper-parameter values during the GAN training.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.


(46) Chapter 3 Synthesizing Electronic Health Records 政治大立 Data Collection, Processing and Analysis. ‧ 國. 學. 3.1. This section discusses data collection procedure, data processing (converting to aggregated. ‧. and binary data) and various statistics of the data sets used in this study.. io. al. er. Data collection. sit. y. Nat. 3.1.1. n. v i n C h Care (MIMIC-III) Medical Information Mart for Intensive e n g c h i U database, [55] a freely available. The datasets used in this study were obtained from two sources. The first source was the. public database comprising de-identified EHRs associated with approximately 60K patient admissions to the critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. MIMIC-III contains various types of health-related data, of which we used patients’ diagnoses data (DIAGNOSES_ICD) and procedures (PROCEDURES_ICD) data, coded using the International Statistical Classification of Diseases and Related Health Problems (ICD) system [56]. In this study, we investigated two different MIMIC-III datasets: one dataset consists of diagnoses data and the other (extended MIMIC-III) consists of both diagnoses and procedures data. The second source was the Taiwan National Health Insurance. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(47) 24. Synthesizing Electronic Health Records. Research Database (NHIRD) [57] that contains data of both patients and medical facilities under the National Health Insurance program. Access to this NHIRD dataset is limited, but permission is provided for its use for research work in Taiwan. We used the LHID2005: Longitudinal Health Insurance Database 2005 (a subset of the NHIRD) for the years between 1996 and 2011 and extracted inpatient expenditures by admission (DD) from it. Similar to MIMIC-III, we separated patients’ diagnoses data coded using the ICD system. Note that although our datasets are of patients’ diagnoses and procedures data, these include a rich set of information of various diseases, injuries, congenital anomalies, symptoms, signs,. 政治大. abnormal conditions, some supplementary factors influencing health status, operations and medical services etc. [58, 59].. 立. Like medGAN, in this research, we concentrated our investigations on generating aggre-. ‧ 國. 學. gated count data (how many times a patient associated with a specific ICD code of disease. ‧. or procedure) and binary data (absence or presence of specific ICD codes). The use of aggregated EHR data is common in many studies for machine learning tasks [60–63]. The. Nat. sit. n. al. er. io. binary data.. y. following two sections describe converting longitudinal EHR data to aggregated count and. 3.1.2. C (count) Convert to aggregated data h. engchi. i n U. v. For a fair comparison with medGAN, we reduced the ICD codes to three-digit codes for each dataset. Note that in the longitudinal EHR datasets, each row corresponds to a patient’s admission record of diagnoses data (MIMIC-III and NHIRD) or of diagnoses and procedures data (extended MIMIC-III), represented by ICD codes. A patient likely visits a hospital more than once, so s/he may have multiple records in the EHR data. We aggregated each patient’s longitudinal record into a single fixed-sized vector of ICD codes. Thus, we represented each dataset as a multidimensional matrix, in which a row corresponds to a patient’s record and a column to a specific ICD code (e.g., diagnoses code or procedure code). Since ICD codes. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(48) 25. 3.1 Data Collection, Processing and Analysis. are aggregated by the patients, they are all count variables. The count variables indicate the number of times a patient was associated with a specific ICD code. Table 3.1 shows a portion of sample count dataset. Here, all values in Table 3.1 are anonymized. Table 3.1 A portion of sample count dataset Patient ID AAAAAA BBBBBB CCCCCC ... XXXXXX. 立. ICD_819 4 0 2 ... 0. 政治大. ICD_363 5 0 0 ... 4. Convert to binary data. ‧ 國. 學. 3.1.3. ICD_817 2 0 3 ... 1. Note that all the features in our three datasets MIMIC-III, extended MIMIC-III and NHIRD. ‧. are count variables. Since we would like to analyze both count and binary discrete variables,. sit. y. Nat. we prepared a binary version of each count dataset by converting the aggregated count. io. n. al.     1, if ci > 0. Cbih= e n hi   gc. er. variables (say ci ) to binary variables (say bi ) by using the following equation.. i n U. v. (3.1). 0, otherwise. Table 3.2 shows a portion of sample binary dataset derived from the corresponding count dataset in Table 3.1. The binary variables indicate whether a patient was associated with a specific ICD code.. 3.1.4. Statistics of datasets. Some basic statistics of the three datasets derived the two different data sources are presented in Table 3.3. Observe that NHIRD dataset is larger than the MIMIC-III datasets in terms of. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(49) 26. Synthesizing Electronic Health Records Table 3.2 A portion of sample binary dataset Patient ID AAAAAA BBBBBB CCCCCC ... XXXXXX. ICD_817 1 0 1 ... 1. ICD_819 1 0 1 ... 0. ICD_363 1 0 0 ... 1. the number of patients/records. There are 942 ICD codes in MIMIC-III diagnoses dataset, 1,651 ICD codes (diagnoses codes: 940 and procedures codes: 711) in extended MIMIC-III. 政治大 Figure 3.1, NHIRD dataset is sparser than the MIMIC-III datasets. In Figure 3.1a, we plot 立. dataset and 1,015 ICD codes in NHIRD diagnoses dataset. However, as it can be seen from. ‧ 國. 學. the empirical cumulative distribution function (ECDF) of the number of unique ICD codes associated with all the patients in each dataset. In NHIRD, 70% of patients have five or less. ‧. unique ICD codes, whereas, in MIMIC-III and extended MIMIC-III, the same percentage of patients have up to 13 and 18 unique ICD codes, respectively. In Figure 3.1b, we compute. y. Nat. sit. the proportion of patients associated with each ICD code and then plot the ECDF of the. al. er. io. proportion of patients. In NHIRD, 90% of the ICD codes (913 among 1,015) are associated. n. v i n C h of patients andUin extended MIMIC-III, 90% of among 942) are associated with up to 2.95% engchi. with only 1.31% of patients or less, whereas in MIMIC-III, 90% of the ICD codes (845. the ICD codes (1,487 among 1,651) are associated with up to 2.17% of patients. Note that as shown in Table 3.3, MIMIC-III dataset denotes only diagnoses data whereas extended MIMIC-III dataset denotes both diagnoses and procedures data for the onward texts, tables and figures. Table 3.4 and Table 3.5 list top 10 frequent ICD codes along with their meaning, frequency of occurrences, number of unique patients and percentage of patients associated with each code in MIMIC-III and NHIRD diagnoses datasets. The detailed description of each ICD code can be searched on the following website, http://icd9.chrisendres.com/. Table 3.6 shows. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(50) 27. 3.2 Experiments Table 3.3 Basic statistics of datasets Statistics # of patients / records # of unique ICD codes / dimensions Avg. # of codes per patient Max. # of codes for a patient Min. # of codes for a patient. 立. MIMIC-III (diagnoses data) 46,517 942 13.99 540 1. Extended MIMIC-III (diagnoses + procedures data) 42,214 1,651 (diagnoses: 940 and procedures: 711) 20.17 610 2. NHIRD, Taiwan (diagnoses data) 498,909 1,015 8.42 687 1. 政治大. ‧. ‧ 國. 學 er. io. sit. y. Nat. Fig. 3.1 ECDFs of ICD codes and patients for MIMIC-III, extended MIMIC-III and NHIRD datasets.. n. al. Ch. engchi. i n U. v. top 10 patients’ data of MIMIC-III and NHIRD datasets that include the frequency (i.e. the total number of ICD codes), the total number of unique ICD codes and percentage of unique ICD codes for each patient.. 3.2. Experiments. This section mainly discusses the experimental setup and the process of training the generative models in this study. It also describes the methods for evaluating synthetic data including some statistical and machine learning methods.. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(51) 28. Synthesizing Electronic Health Records. Table 3.4 Top frequent ICD codes of MIMIC-III. ICD_401 ICD_427 ICD_428 ICD_276 ICD_250 ICD_414. 21,329 20,998 20,676. Percent of patients associated with 38.76 % 30.14 % 21.83 %. 20,440. 12,645. 27.18 %. 16,454. 10,318. 22.18 %. 11,926. 25.64 %. 12,268 11,363 10,631 9,536. 26.37 % 24.43 % 22.85 % 20.50 %. Frequency. Essential hypertension Cardiac dysrhythmias Heart failure Disorders of fluid, electrolyte, and acid-base balance Diabetes mellitus Other forms of chronic ischemic heart disease Disorders of lipoid metabolism Other diseases of lung Other and unspecified anemias Acute renal failure. 立. 政治15,759 大 14,768 14,608 12,910 11,467. 學. ICD_272 ICD_518 ICD_285 ICD_584. Meaning. ‧. ‧ 國. Top ICD codes. No. of patients associated with 18,031 14,022 10,154. Table 3.5 Top frequent ICD codes of NHIRD, Taiwan. Percent of patients associated with 8.88 % 13.28 %. 89,524. 47,394. 9.50 %. 84,584 68,484 67,437. 4,622 41,982 47,154. 0.93 % 8.41 % 9.45 %. 66,082. 42,940. 8.61 %. 61,985. 28,228. 5.66 %. 60,200. 43,896. 8.80 %. 59,547. 24,796. 4.97 %. Frequency. al. n. ICD_250 ICD_401 ICD_599 ICD_295 ICD_486 ICD_650 ICD_276 ICD_414 ICD_V27 ICD_571. Ch. Diabetes mellitus Essential hypertension Other disorders of urethra and urinary tract Schizophrenic disorders Pneumonia, organism unspecified Normal delivery Disorders of fluid, electrolyte, and acid-base balance Other forms of chronic ischemic heart disease Outcome of delivery Chronic liver disease and cirrhosis. i n U. 170,162 144,662. engchi. sit. No. of patients associated with 44,284 66,258. er. io. Meaning. y. Nat. Top ICD codes. v. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(52) 29. 3.2 Experiments Table 3.6 Top patients’ data of MIMIC-III and NHIRD datasets SN. of top patients. 政治大. 立. 學. 3.2.1. NHIRD, Taiwan Frequency No. of Percent (No. of unique of unique total ICD ICD ICD codes) codes codes 687 18 1.77 % 605 5 0.49 % 527 23 2.27 % 505 16 1.58 % 501 5 0.49 % 490 14 1.38 % 487 15 1.48 % 485 20 1.97 % 469 7 0.69 % 466 8 0.79 %. ‧ 國. 1 2 3 4 5 6 7 8 9 10. MIMIC-III Frequency No. of Percent (No. of unique of unique total ICD ICD ICD codes) codes codes 540 88 9.34 % 362 85 9.02 % 361 44 4.67 % 360 70 7.43 % 359 61 6.48 % 332 74 7.86 % 326 79 8.39 % 323 42 4.46 % 316 77 8.17 % 293 64 6.79 %. Experimental setup. ‧. We obtained the source code of medGAN from the GitHub repository on [64], trained. Nat. sit. y. medGAN and applied it to generate synthetic data without changing its scripts. In our. al. er. io. medWGAN and medBGAN, we changed a few lines of code to implement WGAN-GP and. v. n. BGAN. The source code to reproduce the result is publicly available at:. Ch. engchi. https://github.com/baowaly/SynthEHR.. i n U. We split each of the MIMIC-III, extended MIMIC-III and NHIRD datasets into two parts, namely training and testing datasets, at a 4:1 ratio. We used the training dataset to train the models and generate the same number of synthetic EHRs. We reserved the testing dataset to test the predictive models. Most of the parameter settings of medGAN were retained in our models. Some of the common settings are listed in Table 3.7.. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(53) 30. Synthesizing Electronic Health Records Table 3.7 Experimental settings # of training samples of MIMIC-III # of training samples of extended MIMIC-III # of training samples of NHIRD, Taiwan # of epochs to pre-train the autoencoder # of epochs to train the model Batch size Generator size Discriminator size. 3.2.2. 37,213 33,771 399,127 100 1,000 1,000 (128, 128, 128) (256, 128, 1). Training the models. 治政 We further split the training data into training and validation 大 subsets by 9:1 ratio. We pre-train 立 the autoencoder for 100 epochs using the training subset and for every epoch we report the ‧ 國. 學. training and validation loss, which is defined as binary-cross entropy for binary variables and mean squared error for count variables. From the training curve, we observe that 100 epochs. ‧. are sufficient and there is no overfitting.. sit. y. Nat. After pre-training the autoencoder, we copy the decoder part and cascade it to be the last. io. er. layer of the generator G, and train the GAN networks for 1,000 epochs using the 90% training subset. For every epoch, we use the rest 10% validation subset to check the performance. n. al. Ch. i n U. v. (accuracy and AUC) of the discriminator D as a binary classifier. More importantly, we. engchi. use the generator G to randomly generate synthetic data for every 10 epochs during the training process, and perform some sanity checks on these temporarily generated data such as dimension-wise averages and number of nonzero dimensions. As the training process progresses, we observe that the quality of the temporarily generated synthetic data becomes better and better with all checking items become stable after 700 ∼ 800 epochs in all cases. We examined different numbers of discriminator and generator training cycles, which we defined as the discriminator-to-generator ratio, to update them for each training epoch. Based on the correlation coefficients (CCs) between the dimension-wise averages of training. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(54) 31. 3.2 Experiments. data and final synthetic data, we set this ratio to 2:1 for medGAN and medWGAN, and 5:1 for medBGAN. Generation of synthetic binary data: We trained the models and generated synthetic data with sizes being the nearest multiples of the batch size in the training samples (Table 3.3), i.e., 37,000, 33,000 and 399,000 samples from MIMIC-III, extended MIMIC-III and NHIRD, respectively. The raw generated data values were continuous in the range 0–1. We converted them to binary (0 or 1) through rounding. Generation of synthetic count data: Similar to the binary samples, for count variables,. 政治大 data. However, the raw generated data values were any continuous nonnegative numbers. We 立. we used the same number of training samples to train the models and generate synthetic. rounded the continuous values of the synthetic data to the nearest integer values.. ‧ 國. 學. System information and computation time: Our computing server was equipped with. ‧. two Intel Xeon E5-2667 (each with 8 physical cores), 512GB RAM, eight Nvidia GeForce GTX 1080 Ti’s, and CUDA 8.0; although we used a single GPU at a time for training the. Nat. sit. y. models. We implemented our methods with TensorFlow 1.4. The average running time. al. er. io. required to train the models and generating the synthetic data was 1.88 hours for MIMIC-III,. v. n. 2.29 hours for extended MIMIC-III and 20.12 hours for NHIRD datasets.. 3.2.3. Ch. engchi. i n U. Methods for evaluating synthetic data. After the generation of the synthetic EHRs, the obvious issue was to evaluate these generated data and compare them with the real EHRs. For these purposes, we employed some evaluation methods from two different perspectives as follow.. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.

(55) 32. Synthesizing Electronic Health Records. Statistical methods As a basic sanity check to ensure whether our models learned the distribution of each dimension acceptably, we calculated the dimension-wise average and performed dimensionwise Kolmogorov–Smirnov test (K–S test). • Dimension-wise average: It refers to the column average of each dimension (disease or procedure code) in the dataset. The dimension-wise average is calculated using the following formula. Column sum 政治大 o f records Total number. Dimension − wise average =. 立. (3.2). ‧ 國. 學. • Dimension-wise K–S test: We performed the K–S test on two data samples (synthetic data and real data) to examine whether the two data samples originate from the same. ‧. distribution. In the K–S test, the statistic is calculated by finding the maximum absolute. sit. y. Nat. value of the differences between two samples’ cumulative distribution functions [65]. The null hypothesis is that both samples originate from a population with the same. io. n. al. er. distribution. In our experiment, we rejected the null hypothesis with a low p-value. Ch. i n U. v. (typically ≤ 0.05). More details of K–S test is discussed in the results section.. engchi. Machine learning methods We applied association rule mining and dimension-wise prediction to test how interdimensional relationships are preserved in the synthetic data. • Association rule mining: Association rule mining such as Apriori is widely used on EHR data to identify associations and interpretable patterns among clinical concepts (medications, laboratory results, and problem diagnoses) [66–68]. We employed this rule-based machine learning method for discovering some strong associations or relations among variables in both real and synthetic datasets. We checked whether the. DOI:10.6814/DIS.NCCU.TIGP.002.2019.B02.