• 沒有找到結果。

Therefore, as a general method, the proposed models can be used to generate any realistic synthetic data with discrete features, beyond the medical or crime domain.

This research introduces an approach, method, and system that can produce synthetic data which are free from any legal, privacy, and security issues. Since the synthetic data generated by using our method can carry the attributes of real data perfectly, people can use it for their academic, research or business purposes. The application of our proposed method thus can help to mitigate the difficulty in obtaining real data in respective cases. We hope this study will play a significant role in advancing research and industry development.

5.3 Funding

This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.

5.4 Competing Interests

The authors have no competing interests to declare.

5.5 Contributors

Mrinal Kanti Baowaly primarily contributed to the conception and design of the work; the acquisition, analysis, and interpretation of data; implementation of the work; finding out the results; evaluation and analysis of the results. He drafted the work. Prof. Sheng-Wei “Kuan-Ta” Chen and Prof. Chao-Lin Liu both significantly contributed to the work supervising the whole research and advising to draft the work. All authors revised the work critically for important intellectual content and approved the final version of the dissertation. All of them agree to be accountable for all aspects of the work and will help in ensuring that questions

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

74 Concluding Remarks

related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

[1] Mrinal Kanti Baowaly, Chia-Ching Lin, Chao-Lin Liu, and Kuan-Ta Chen. Synthesizing Electronic Health Records Using Improved Generative Adversarial Networks. Journal of the American Medical Informatics Association, 26(3):228–241, 12 2018.

[2] Mrinal Kanti Baowaly, Chao-Lin Liu, and Kuan-Ta Chen. Realistic Data Synthesis Using Enhanced Generative Adversarial Networks. In 2019 IEEE International Confer-ence on Artificial IntelligConfer-ence and Knowledge Engineering (IEEE AIKE 2019). IEEE, June 2019.

[3] Donald B Rubin. Statistical disclosure limitation. Journal of official Statistics, 9(2):461–

468, 1993.

[4] Office for Civil Rights. Guidance Regarding Methods for De-identification of Pro-tected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. U.S. Department of Health and Human Ser-vices, November 2013. [online] https://www.hhs.gov/hipaa/for-professionals/privacy/

special-topics/de-identification/index.html, Accessed 12 Mar 2017.

[5] Khaled El Emam, Elizabeth Jonker, Luk Arbuckle, and Bradley Malin. A systematic review of re-identification attacks on health data. PloS one, 6(12):e28071, 2011.

[6] Khaled El Emam, Sam Rodgers, and Bradley Malin. Anonymising and sharing individ-ual patient data. bmj, 350:h1139, 2015.

[7] Ross Anderson. Under threat: patient confidentiality and NHS computing. Drugs and Alcohol Today, 6(4):13–17, 2006.

[8] Paul Ohm. Broken promises of privacy: Responding to the surprising failure of anonymization (August 13, 2009). UCLA Law Review, 57:1701, 2010.

[9] Melissa Gymrek, Amy L. McGuire, David Golan, Eran Halperin, and Yaniv Erlich.

Identifying Personal Genomes by Surname Inference. Science, 339(6117):321–324, 2013.

[10] Jason Walonoski, Mark Kramer, Joseph Nichols, and et al. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association, 25(3):230–238, 2018.

[11] John M. Abowd and Julia Lane. New Approaches to Confidentiality Protection: Syn-thetic Data, Remote Access and Research Data Centers. In Josep Domingo-Ferrer and Vicenç Torra, editors, Privacy in Statistical Databases, pages 282–289, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.

[12] Roderick JA Little. Statistical Analysis of Masked Data. JOURNAL OF OFFICIAL STATISTICS-STOCKHOLM-, 9:407–407, 1993.

[13] Jim Gray, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and Peter J. Wein-berger. Quickly Generating Billion-record Synthetic Databases. SIGMOD Rec., 23(2):243–252, May 1994.

[14] Stephen E Fienberg and Russell J Steele. Disclosure Limitation Using Perturbation and Related Methods for Categorical Data. Journal of Official Statistics, 14(4):485, 1998.

[15] Stephen E Fienberg. A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Department of statistics, 1994.

[16] SE Fienberg. Taking uncertainty and error in censuses and surveys seriously. In Proceedings of Statistics Canada Symposium 95: From Data to Information-Methods and Systems, 1996.

[17] Stephen E Fienberg, Russell J Steele, and Udi E Makov. Statistical notions of data disclosure avoidance and their relationship to traditional statistical methodology: data swapping and log-linear models. In Proceedings of Bureau of the Census 1996 Annual Research Conference, pages 87–105, 1996.

[18] Trivellore E Raghunathan, Jerome P Reiter, and Donald B Rubin. Multiple imputation for statistical disclosure limitation. Journal of official statistics, 19(1):1, 2003.

[19] Yaling Pei and Osmar Zaïane. A synthetic data generator for clustering and outlier analysis. Technical report, TR06-15, 2006.

[20] Kenneth Houkjær, Kristian Torp, and Rico Wind. Simple and realistic data generation.

In Proceedings of the 32Nd International Conference on Very Large Data Bases, VLDB

’06, pages 1243–1246. VLDB Endowment, 2006.

[21] Peter Christen and Agus Pudjijono. Accurate synthetic generation of realistic personal information. In Advances in Knowledge Discovery and Data Mining, pages 507–514, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.

[22] M. Bozkurt and M. Harman. Automatically generating realistic test input from web services. In Proceedings of 2011 IEEE 6th International Symposium on Service Oriented System (SOSE), pages 13–24, Dec 2011.

[23] Joseph S. Lombardo and Linda J. Moniz. A Method for Generation and Distribution of Synthetic Medical Record Data for Evaluation of Disease-Monitoring Systems. Johns Hopkins APL Technical Digest, 27(4), 2008.

[24] Anna L Buczak, Steven Babin, and Linda Moniz. Data-driven approach for creating synthetic electronic medical records. BMC medical informatics and decision making,

[25] S. McLachlan, K. Dube, and T. Gallagher. Using the CareMap with Health Incidents Statistics for Generating the Realistic Synthetic Electronic Healthcare Record. In 2016 IEEE International Conference on Healthcare Informatics (ICHI), pages 439–448, October 2016.

[26] Y. Park, J. Ghosh, and M. Shankar. Perturbed Gibbs Samplers for Generating Large-Scale Privacy-Safe Synthetic Health Data. In 2013 IEEE International Conference on Healthcare Informatics, pages 493–498, September 2013.

[27] S. McLachlan. Realism in synthetic data generation. Massey University, Palmerston North, New Zealand, February 2017. [online] http://hdl.handle.net/10179/11569, Ac-cessed 5 Oct 2017.

[28] Edward Choi, Siddharth Biswal, Bradley Malin, and et al. Generating Multi-label Discrete Electronic Health Records using Generative Adversarial Networks. CoRR, abs/1703.06490, 2017.

[29] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, and et al. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680.

Curran Associates, Inc., 2014.

[30] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, and et al. Improved Techniques for Training GANs. In Advances in Neural Information Processing Systems 29, pages 2234–2242. Curran Associates, Inc., 2016.

[31] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learn-ing with Deep Convolutional Generative Adversarial Networks. CoRR, abs/1511.06434, 2015.

[32] Yanghua Jin, Jiakai Zhang, Minjun Li, and et al. Towards the Automatic Anime Characters Creation with Generative Adversarial Networks. CoRR, abs/1708.05509, 2017.

[33] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, and et al. High-resolution image synthe-sis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585, 2017.

[34] Scott Reed, Zeynep Akata, Xinchen Yan, and et al. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.

[35] Han Zhang, Tao Xu, Hongsheng Li, and et al. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint, 2017.

[36] Hao Dong, Paarth Neekhara, Chao Wu, and Yike Guo. Unsupervised image-to-image translation with generative adversarial networks. arXiv preprint arXiv:1701.02676, 2017.

[37] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image transla-tion with conditransla-tional adversarial networks. arXiv preprint, 2017.

[38] Xun Huang, Ming-Yu Liu, Serge J. Belongie, and Jan Kautz. Multimodal Unsupervised Image-to-Image Translation. CoRR, abs/1804.04732, 2018.

[39] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating Videos with Scene Dynamics. In Advances in Neural Information Processing Systems 29, pages 613–621.

Curran Associates, Inc., October 2016.

[40] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decom-posing motion and content for video generation. arXiv preprint arXiv:1707.04993, 2017.

[41] Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. MidiNet: A Convolutional Genera-tive Adversarial Network for Symbolic-domain Music Generation using 1D and 2D Conditions. CoRR, abs/1703.10847, 2017.

[42] Matt J Kusner and José Miguel Hernández-Lobato. Gans for sequences of discrete elements with the gumbel-softmax distribution. arXiv preprint arXiv:1611.04051, 2016.

[43] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In AAAI, pages 2852–2858, March 2017.

[44] R Devon Hjelm, A. P. Jacob, T. Che, and et al. Boundary-Seeking Generative Adversar-ial Networks. ArXiv e-prints, 2017.

[45] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, and et al. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems 30, pages 5767–5777. Curran Associates, Inc., 2017.

[46] appliedAI. Synthetic Data: An Introduction & 10 Tools. [online] https://blog.appliedai.

com/synthetic-data/, Accessed 31 July 2018.

[47] E. L. Barse, H. Kvarnstrom, and E. Jonsson. Synthesizing test data for fraud de-tection systems. In 19th Annual Computer Security Applications Conference, 2003.

Proceedings., pages 384–394, Dec 2003.

[48] Margaret Rouse and Nicole Laskowski. Synthetic data. [online] https://searchcio.

techtarget.com/definition/synthetic-data, Accessed 11 May 2018.

[49] Yann LeCun. What are some recent and potentially upcoming break-throughs in deep learning?, July 2016. [online] https://www.quora.com/

What-are-some-recent-and-potentially-upcoming-breakthroughs-in-deep-learning, Ac-cessed 3 November 2017.

[50] Ian J. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. CoRR, abs/1701.00160, April 2017.

[51] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. CoRR, abs/1701.07875, December 2017.

[52] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, Cambridge, Massachusetts, United States, 2016. http://www.deeplearningbook.org.

[53] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Ex-tracting and Composing Robust Features with Denoising Autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 1096–

[54] G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504–507, 2006.

[55] Alistair E.W. Johnson, Tom J. Pollard, Lu Shen, and et al. MIMIC-III, a freely accessible critical care database. Scientific Data, May 2016. [online] https://doi.org/10.1038/sdata.

2016.35, Accessed 5 October 2016.

[56] International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM). National Center for Health Statistics (NCHS) and the Centers for Medicare

& Medicaid Services (CMS). [online] https://www.cdc.gov/nchs/icd/icd9cm.htm, Ac-cessed 30 June 2017.

[57] National Health Insurance Research Database, Taiwan. National Health Insurance Administration, Ministry of Health and Welfare, Taiwan. [online] http://nhird.nhri.org.

tw/en/, Accessed 10 January 2016.

[58] Diseases and Injuries Tabular Index. National Center for Health Statistics (NCHS) and the Centers for Medicare & Medicaid Services (CMS). [online] http://icd9.chrisendres.

com/index.php?action=contents, Accessed 10 July 2017.

[59] Procedures Index. National Center for Health Statistics (NCHS) and the Centers for Medicare & Medicaid Services (CMS). [online] http://icd9.chrisendres.com/index.php?

action=procslist, Accessed 10 July 2017.

[60] Blanca E. Himes, Yi Dai, Isaac S. Kohane, and et al. Prediction of Chronic Obstructive Pulmonary Disease (COPD) in Asthma Patients Using Electronic Medical Records.

Journal of the American Medical Informatics Association, 16(3):371–379, 2009.

[61] Jionglin Wu, Jason Roy, and Walter F. Stewart. Prediction Modeling Using EHR Data:

Challenges, Strategies, and a Comparison of Machine Learning Approaches. Medical Care, 48(6):S106–S113, 2010.

[62] Sandy H Huang, Paea LePendu, Srinivasan V Iyer, and et al. Toward personalizing treatment for depression: predicting diagnosis and severity. Journal of the American Medical Informatics Association, 21(6):1069–1075, 2014.

[63] Pedro L Teixeira, Wei-Qi Wei, Robert M Cronin, and et al. Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals.

Journal of the American Medical Informatics Association, 24(1):162–171, 2017.

[64] medGAN Source Code. GitHub repository. [online] https://github.com/mp2893/

medgan, Accessed 15 November 2017.

[65] Wikipedia contributors. Kolmogorov–smirnov test — Wikipedia, the free encyclo-pedia. [online] https://en.wikiencyclo-pedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test, Accessed 20 November 2017.

[66] Pranjul Yadav, Michael Steinbach, Vipin Kumar, and Gyorgy Simon. Mining Electronic Health Records (EHRs): A Survey. ACM Computing Surveys (CSUR), 50(6):85:1–

85:40, January 2018.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

80 References

[67] Adam Wright, Elizabeth S. Chen, and Francine L. Maloney. An automated technique for identifying associations between medications, laboratory results and problems.

Journal of Biomedical Informatics, 43(6):891–901, 2010.

[68] Shin AM, Lee IH, Lee GH, and et al. Diagnostic Analysis of Patients with Essential Hypertension Using Association Rule Mining. Healthcare Informatics Research, 16(2):77–81, June 2010.

[69] Jimeng Sun, Candace D McNaughton, Ping Zhang, and et al. Predicting changes in hypertension control using electronic health records from a chronic disease management program. Journal of the American Medical Informatics Association, 21(2):337–344, 2014.

[70] Los Angeles’ Crime Data, Los Angeles Police Department, USA. [online] https:

//data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq, Accessed 15 January 2018.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Appendix A

How to install the models to generate synthetic data

Ubuntu/Linux

Getting the distribution:

1. The source codes can be downloaded from https://github.com/baowaly/SynthEHR.

2. This distribution is tested on the Ubuntu/Linux operating system only

Source code explanation:

Our models were implemented using Python and TensorFlow.

model.py defines the MEDGAN, MEDWGAN, and MEDBGAN classes, which will be imported in train.py to build the neural network for GAN training and synthetic data genera-tion.

82 How to install the models to generate synthetic data

System information:

Our computing server was equipped with two Intel Xeon E5-2667 (each with 8 physical cores), 512GB RAM, eight Nvidia GeForce GTX 1080 Ti’s, and CUDA 8.0; although we used a single GPU at a time for training the models. We implemented our methods with TensorFlow 1.4.

Installation:

1. Install ‘Python 3+’

2. Install ‘Tensorflow 1.4’

3. Download/clone the source code ‘model.py’ and ‘train.py’ from the github repository.

Steps to generate synthetic data:

1. Prepare the training data: Request access to MIMIC-III data from https://mimic.physionet.org/gettingstarted/access/

2. Download MIMIC-III data and aggregate the medical codes (e.g. diagonsis codes, medication codes, or procedure codes) for each patient, and save them as an numpy data file (.npy file)

3. Train the GAN and generate synthetic data

• Usage:

$ p y t h o n t r a i n . py −−model [ model o r o u t p u t d i r e c t o r y name ( m u s t c o n t a i n keyword ‘ medGAN’ , ‘medWGAN’ , o r ‘medBGAN ’ ) ] −−

d a t a _ f i l e [ p a t h t o t h e t r a i n i n g d a t a ( npy

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

83

f o r m a t ) ] −−d a t a _ t y p e [ b i n a r y / c o u n t ] −−

n _ p r e t r a i n _ e p o c h 100 −−n _ e p o c h 1000

• During training, a progress bar will be showed for each epoch. Also, a directory will be created according to the model name (‘medGAN’ by default) and two subfolders ‘model’ and ‘out put’ will be created therein, with the former con-taining model checkpoints and the latter concon-taining the synthetic data (called

‘generated.npy’ by default)

• To specify output folder name, add parameter ‘− − model [model_name]’, where

[model_name] is ‘medGAN’ with any prefix or postfix, such as ‘medGAN_n_epoch_500’

• To run improved GAN, add parameter ‘−−model medW GAN’ or ‘−−model medBGAN’.

Again, ‘medW GAN’ and ‘medBGAN’ can also have any prefix or postfix

• For more parameters, please refer to the source code in ‘train.py’.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

相關文件