Please use this identifier to cite or link to this item:
Realistic data synthesis using enhanced generative adversarial networks
Baowaly, Mrinal Kanti
Mrinal Kanti Baowaly
Electronic health records
Synthetic data generation
Generative adversarial networks
Wasserstein GANs with Gradient Penalty
|Issue Date: ||2019-06-03 13:08:37 (UTC+8)|
There are many situations when the real data are not available or are too expensive to afford in respect of both time and money. This is because those data may have privacy and confidentiality concerns. In these situations, it is a good alternative to use synthetic data. The primary objective of this study is to generate realistic synthetic electronic health records (EHRs) so that people can use it freely for progressing research in healthcare or related fields. We propose two synthetic data generation models – designated as medical Wasserstein GAN with gradient penalty (medWGAN) and medical boundary-seeking GAN (medBGAN) – and compare the performances with an existing method medical GAN (medGAN). The proposed models are based on the two enhanced methods of generative adversarial networks (GANs), namely, Wasserstein GAN with gradient penalty (WGAN-GP) and boundary-seeking GAN (BGAN). We perform data synthesis on three aggregated EHR datasets with discrete features (e.g., binary and count) in the medical domain. They are MIMIC-III, extended MIMIC-III and National Health Insurance Research Database (NHIRD), Taiwan. Firstly, we train the models and generate synthetic EHR data by using these trained models. We then analyze and compare the models’ performance by applying some statistical methods (dimension-wise average and Kolmogorov–Smirnov test) and two machine learning tasks (association rule mining and prediction). The comprehensive analysis of this study shows that the proposed models are more effective in generating realistic synthetic EHR data than those generated using medGAN.
Our models can be applied to generate any realistic synthetic data, even beyond the medical domain. To prove the generality of our models, we also investigate an aggregated crime dataset in the City of Los Angeles Police Department apart from the medical domain which confirms our models’ capability to work in a wide range of applications. We prove that the proposed models are suitable for producing high-quality synthetic data with discrete features that are statistically sound and good enough for machine learning tasks. We believe the proposed models will be effective in industry and research from the viewpoint of providing better services in generating realistic synthetic data. This study will help to eliminate barriers including limited access to confidential data and thus accelerate the development of medical informatics, healthcare or related fields.
|Reference: || Mrinal Kanti Baowaly, Chia-Ching Lin, Chao-Lin Liu, and Kuan-Ta Chen. Synthesizing Electronic Health Records Using Improved Generative Adversarial Networks. Journal of the American Medical Informatics Association, 26(3):228–241, 12 2018.|
 Mrinal Kanti Baowaly, Chao-Lin Liu, and Kuan-Ta Chen. Realistic Data Synthesis Using Enhanced Generative Adversarial Networks. In 2019 IEEE International Confer- ence on Artificial Intelligence and Knowledge Engineering (IEEE AIKE 2019). IEEE, June 2019.
 Donald B Rubin. Statistical disclosure limitation. Journal of official Statistics, 9(2):461– 468, 1993.
 Office for Civil Rights. Guidance Regarding Methods for De-identification of Pro- tected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. U.S. Department of Health and Human Ser- vices, November 2013. [online] https://www.hhs.gov/hipaa/for-professionals/privacy/ special-topics/de-identification/index.html, Accessed 12 Mar 2017.
 Khaled El Emam, Elizabeth Jonker, Luk Arbuckle, and Bradley Malin. A systematic review of re-identification attacks on health data. PloS one, 6(12):e28071, 2011.
 Khaled El Emam, Sam Rodgers, and Bradley Malin. Anonymising and sharing individual patient data. bmj, 350:h1139, 2015.
 Ross Anderson. Under threat: patient confidentiality and NHS computing. Drugs and Alcohol Today, 6(4):13–17, 2006.
 Paul Ohm. Broken promises of privacy: Responding to the surprising failure of anonymization (August 13, 2009). UCLA Law Review, 57:1701, 2010.
 Melissa Gymrek, Amy L. McGuire, David Golan, Eran Halperin, and Yaniv Erlich. Identifying Personal Genomes by Surname Inference. Science, 339(6117):321–324, 2013.
 Jason Walonoski, Mark Kramer, Joseph Nichols, and et al. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association, 25(3):230–238, 2018.
 John M. Abowd and Julia Lane. New Approaches to Confidentiality Protection: Synthetic Data, Remote Access and Research Data Centers. In Josep Domingo-Ferrer and Vicenç Torra, editors, Privacy in Statistical Databases, pages 282–289, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.
 Roderick JA Little. Statistical Analysis of Masked Data. JOURNAL OF OFFICIAL STATISTICS-STOCKHOLM-, 9:407–407, 1993.
 Jim Gray, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and Peter J. Weinberger. Quickly Generating Billion-record Synthetic Databases. SIGMOD Rec., 23(2):243–252, May 1994.
 Stephen E Fienberg and Russell J Steele. Disclosure Limitation Using Perturbation and Related Methods for Categorical Data. Journal of Official Statistics, 14(4):485, 1998.
 Stephen E Fienberg. A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Department of statistics, 1994.
 SE Fienberg. Taking uncertainty and error in censuses and surveys seriously. In Proceedings of Statistics Canada Symposium 95: From Data to Information-Methods and Systems, 1996.
 Stephen E Fienberg, Russell J Steele, and Udi E Makov. Statistical notions of data disclosure avoidance and their relationship to traditional statistical methodology: data swapping and log-linear models. In Proceedings of Bureau of the Census 1996 Annual Research Conference, pages 87–105, 1996.
 Trivellore E Raghunathan, Jerome P Reiter, and Donald B Rubin. Multiple imputation for statistical disclosure limitation. Journal of official statistics, 19(1):1, 2003.
 Yaling Pei and Osmar Zaïane. A synthetic data generator for clustering and outlier analysis. Technical report, TR06-15, 2006.
 Kenneth Houkjær, Kristian Torp, and Rico Wind. Simple and realistic data generation. In Proceedings of the 32Nd International Conference on Very Large Data Bases, VLDB ’06, pages 1243–1246. VLDB Endowment, 2006.
 Peter Christen and Agus Pudjijono. Accurate synthetic generation of realistic personal information. In Advances in Knowledge Discovery and Data Mining, pages 507–514, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.
 M. Bozkurt and M. Harman. Automatically generating realistic test input from web services. In Proceedings of 2011 IEEE 6th International Symposium on Service Oriented System (SOSE), pages 13–24, Dec 2011.
 Joseph S. Lombardo and Linda J. Moniz. A Method for Generation and Distribution of Synthetic Medical Record Data for Evaluation of Disease-Monitoring Systems. Johns Hopkins APL Technical Digest, 27(4), 2008.
 Anna L Buczak, Steven Babin, and Linda Moniz. Data-driven approach for creating synthetic electronic medical records. BMC medical informatics and decision making, 10(1):59, 2010.
 S. McLachlan, K. Dube, and T. Gallagher. Using the CareMap with Health Incidents Statistics for Generating the Realistic Synthetic Electronic Healthcare Record. In 2016 IEEE International Conference on Healthcare Informatics (ICHI), pages 439–448, October 2016.
 Y. Park, J. Ghosh, and M. Shankar. Perturbed Gibbs Samplers for Generating Large- Scale Privacy-Safe Synthetic Health Data. In 2013 IEEE International Conference on Healthcare Informatics, pages 493–498, September 2013.
 S. McLachlan. Realism in synthetic data generation. Massey University, Palmerston North, New Zealand, February 2017. [online] http://hdl.handle.net/10179/11569, Ac- cessed 5 Oct 2017.
 Edward Choi, Siddharth Biswal, Bradley Malin, and et al. Generating Multi-label Discrete Electronic Health Records using Generative Adversarial Networks. CoRR, abs/1703.06490, 2017.
 Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, and et al. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
 Tim Salimans, Ian Goodfellow, Wojciech Zaremba, and et al. Improved Techniques for Training GANs. In Advances in Neural Information Processing Systems 29, pages 2234–2242. Curran Associates, Inc., 2016.
 Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. CoRR, abs/1511.06434, 2015.
 Yanghua Jin, Jiakai Zhang, Minjun Li, and et al. Towards the Automatic Anime Characters Creation with Generative Adversarial Networks. CoRR, abs/1708.05509, 2017.
 Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, and et al. High-resolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585, 2017.
 Scott Reed, Zeynep Akata, Xinchen Yan, and et al. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
 Han Zhang, Tao Xu, Hongsheng Li, and et al. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint, 2017.
 Hao Dong, Paarth Neekhara, Chao Wu, and Yike Guo. Unsupervised image-to-image translation with generative adversarial networks. arXiv preprint arXiv:1701.02676, 2017.
 Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint, 2017.
 Xun Huang, Ming-Yu Liu, Serge J. Belongie, and Jan Kautz. Multimodal Unsupervised Image-to-Image Translation. CoRR, abs/1804.04732, 2018.
 Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating Videos with Scene Dynamics. In Advances in Neural Information Processing Systems 29, pages 613–621. Curran Associates, Inc., October 2016.
 Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. arXiv preprint arXiv:1707.04993, 2017.
 Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation using 1D and 2D Conditions. CoRR, abs/1703.10847, 2017.
 Matt J Kusner and José Miguel Hernández-Lobato. Gans for sequences of discrete elements with the gumbel-softmax distribution. arXiv preprint arXiv:1611.04051, 2016.
 Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In AAAI, pages 2852–2858, March 2017.
 R Devon Hjelm, A. P. Jacob, T. Che, and et al. Boundary-Seeking Generative Adversarial Networks. ArXiv e-prints, 2017.
 Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, and et al. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems 30, pages 5767–5777. Curran Associates, Inc., 2017.
 appliedAI. Synthetic Data: An Introduction & 10 Tools. [online] https://blog.appliedai. com/synthetic-data/, Accessed 31 July 2018.
 E. L. Barse, H. Kvarnstrom, and E. Jonsson. Synthesizing test data for fraud detection systems. In 19th Annual Computer Security Applications Conference, 2003. Proceedings., pages 384–394, Dec 2003.
 Margaret Rouse and Nicole Laskowski. Synthetic data. [online] https://searchcio. techtarget.com/definition/synthetic-data, Accessed 11 May 2018.
 Yann LeCun. What are some recent and potentially upcoming breakthroughs in deep learning?, July 2016. [online] https://www.quora.com/ What-are-some-recent-and-potentially-upcoming-breakthroughs-in-deep-learning, Accessed 3 November 2017.
 Ian J. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. CoRR, abs/1701.00160, April 2017.
 Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. CoRR, abs/1701.07875, December 2017.
 Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, Cambridge, Massachusetts, United States, 2016. http://www.deeplearningbook.org.
 Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and Composing Robust Features with Denoising Autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 1096– 1103, New York, NY, USA, 2008. ACM.
 G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504–507, 2006.
 Alistair E.W. Johnson, Tom J. Pollard, Lu Shen, and et al. MIMIC-III, a freely accessible critical care database. Scientific Data, May 2016. [online] https://doi.org/10.1038/sdata. 2016.35, Accessed 5 October 2016.
 International Classification of Diseases, Ninth Revision, Clinical Modification (ICD- 9-CM). National Center for Health Statistics (NCHS) and the Centers for Medicare
& Medicaid Services (CMS). [online] https://www.cdc.gov/nchs/icd/icd9cm.htm, Accessed 30 June 2017.
 National Health Insurance Research Database, Taiwan. National Health Insurance Administration, Ministry of Health and Welfare, Taiwan. [online] http://nhird.nhri.org. tw/en/, Accessed 10 January 2016.
 Diseases and Injuries Tabular Index. National Center for Health Statistics (NCHS) and the Centers for Medicare & Medicaid Services (CMS). [online] http://icd9.chrisendres. com/index.php?action=contents, Accessed 10 July 2017.
 Procedures Index. National Center for Health Statistics (NCHS) and the Centers for Medicare & Medicaid Services (CMS). [online] http://icd9.chrisendres.com/index.php? action=procslist, Accessed 10 July 2017.
 Blanca E. Himes, Yi Dai, Isaac S. Kohane, and et al. Prediction of Chronic Obstructive Pulmonary Disease (COPD) in Asthma Patients Using Electronic Medical Records. Journal of the American Medical Informatics Association, 16(3):371–379, 2009.
 Jionglin Wu, Jason Roy, and Walter F. Stewart. Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches. Medical Care, 48(6):S106–S113, 2010.
 Sandy H Huang, Paea LePendu, Srinivasan V Iyer, and et al. Toward personalizing treatment for depression: predicting diagnosis and severity. Journal of the American Medical Informatics Association, 21(6):1069–1075, 2014.
 Pedro L Teixeira, Wei-Qi Wei, Robert M Cronin, and et al. Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals. Journal of the American Medical Informatics Association, 24(1):162–171, 2017.
 medGAN Source Code. GitHub repository. [online] https://github.com/mp2893/ medgan, Accessed 15 November 2017.
 Wikipedia contributors. Kolmogorov–smirnov test — Wikipedia, the free encyclopedia. [online] https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test, Accessed 20 November 2017.
 Pranjul Yadav, Michael Steinbach, Vipin Kumar, and Gyorgy Simon. Mining Electronic Health Records (EHRs): A Survey. ACM Computing Surveys (CSUR), 50(6):85:1– 85:40, January 2018.
 Adam Wright, Elizabeth S. Chen, and Francine L. Maloney. An automated technique for identifying associations between medications, laboratory results and problems. Journal of Biomedical Informatics, 43(6):891–901, 2010.
 Shin AM, Lee IH, Lee GH, and et al. Diagnostic Analysis of Patients with Essential Hypertension Using Association Rule Mining. Healthcare Informatics Research, 16(2):77–81, June 2010.
 Jimeng Sun, Candace D McNaughton, Ping Zhang, and et al. Predicting changes in hypertension control using electronic health records from a chronic disease management program. Journal of the American Medical Informatics Association, 21(2):337–344, 2014.
 Los Angeles’ Crime Data, Los Angeles Police Department, USA. [online] https:
//data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq, Accessed 15 January 2018.
|Source URI: ||http://thesis.lib.nccu.edu.tw/record/#G0104761507|
|Data Type: ||thesis|
|Appears in Collections:||[社群網路與人智計算國際研究生博士學位學程(TIGP)] 學位論文|
Files in This Item:
All items in 政大典藏 are protected by copyright, with all rights reserved.