• 沒有找到結果。

Dimension-wise prediction performance

3.3 Evaluation Results on Synthetic Data

3.3.5 Dimension-wise prediction performance

40 Synthesizing Electronic Health Records

Table 3.11 Association rule mining results

Dataset

medGAN 180 61 84.72 %

medWGAN 64 52 72.22 %

medBGAN 153 67 93.05 %

Extended

MIMIC-III 154

medGAN 274 134 87.01 %

medWGAN 201 142 92.20 %

medBGAN 229 150 97.40 %

NHIRD,

Taiwan 63

medGAN 1,350 56 88.88 %

medWGAN 62 50 79.36 %

medBGAN 520 60 95.23 %

NHIRD, Taiwan: For NHIRD data, medBGAN also yields the highest number of rules, 60 (95.23 % of real data). However, medWGAN fails to outperform medGAN.

From the association rule mining, it is clear that medBGAN is able to reproduce most of the rules seen in the real data (93.05% for MIMIC-III, 97.40% for extended MIMIC-III and 95.23% for NHIRD). As compared with the baseline model, medBGAN has achieved the highest 10.39% improvement. In contrast, medWGAN fails to achieve this expected results (72.22% for MIMIC-III, 92.20% for extended MIMIC-III and 79.36% for NHIRD). It only outperforms medGAN for the extended MIMIC-III.

3.3.5 Dimension-wise prediction performance

This section involves determining how well our synthetic data created by the generative models perform compared with the real data in machine learning prediction task. Here, we show the dimension-wise prediction performance of both binary and count variables for MIMIC-III, extended MIMIC-III and NHIRD synthetic data using each of the three aforementioned predictive models.

3.3 Evaluation Results on Synthetic Data 41

Prediction performance for binary data

Figure 3.4 shows the dimension-wise prediction results of the three generative models obtained from the results of logistic regression model trained on MIMIC-III, extended MIMIC-III and NHIRD synthetic binary data and the corresponding real data. In the scatterplots of Figure 3.4, each dot represents one ICD code. The x-axis represents the F1-scores of the logistic regression model trained on the real binary data, and the y-axis represents the F1-scores of the logistic regression model trained on the synthetic binary data. The prediction performances, are calculated by the correlation coefficients (CCs) between the synthetic and real data prediction results (F1-scores), are also shown in the scatterplots. Regarding the prediction performances for MIMIC-III binary data in Figure 3.4a, for extended MIMIC-III binary data in Figure 3.4b and for NHIRD binary data in Figure 3.4c, both medWGAN and medBGAN outperform medGAN. Notably, medBGAN shows the highest performance of all generative models.

Like the prediction results of logistic regression model, for binary data, Figure 3.5 and 3.6 show the prediction results of the other two machine learning classifiers, random forests and SVM respectively. Regarding the prediction performances of random forests as shown in Figure 3.5a, Figure 3.5b and Figure 3.5c, both medWGAN and medBGAN outperform medGAN for MIMIC-III and extended MIMIC-III binary data but medBGAN alone shows better results than the baseline method for NHIRD binary data. On the other hand, the prediction performances of SVM as shown in Figure 3.6a, Figure 3.6b and Figure 3.6c, the proposed models slightly outperform medGAN or show comparable results to it for MIMIC-III and extended MIMIC-III binary data but provide superior results as compared to the baseline method for NHIRD binary data.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

42 Synthesizing Electronic Health Records

Fig. 3.4 Scatterplots of dimension-wise prediction results (F1-scores) of logistic regression model trained on real binary data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models.

Prediction performance for count data

Figure 3.7 shows the dimension-wise prediction results of the three generative models obtained from the results of logistic regression model trained on MIMIC-III, extended MIMIC-III and NHIRD synthetic count data and the corresponding real data. In the scatter-plots of Figure 3.7, each dot represents one ICD code. The x-axis represents the F1-scores of the logistic regression model trained on the real count data, and the y-axis represents the F1-scores of the logistic regression model trained on the synthetic count data. The prediction

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

3.3 Evaluation Results on Synthetic Data 43

Fig. 3.5 Scatterplots of dimension-wise prediction results (F1-scores) of random forests model trained on real binary data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models.

performances, are calculated by the correlation coefficients (CCs) between the synthetic and real data prediction results (F1-scores), are also shown in the scatterplots. Regarding the dimension-wise prediction performances for MIMIC-III count data in Figure 3.7a and for extended MIMIC-III count data in Figure 3.7b, both medWGAN and medBGAN outperform medGAN but medWGAN has the best performance for MIMIC-III and medBGAN has the best performance for extended MIMIC-III; although they are very close in both cases. By contrast, for NHIRD count data in Figure 3.7c, medWGAN has a slightly higher performance

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

44 Synthesizing Electronic Health Records

Fig. 3.6 Scatterplots of dimension-wise prediction results (F1-scores) of SVM model trained on real binary data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models.

level than those of the other models, but we observe no significant differences among these three generative models.

Like the prediction results of logistic regression model, for count data, Figure 3.8 and 3.9 show the prediction results of the other two machine learning classifiers, random forests and SVM respectively. Regarding the prediction performances of random forests as shown in Figure 3.8a, Figure 3.8b and Figure 3.8c, both medWGAN and medBGAN outperform medGAN for MIMIC-III and extended MIMIC-III count data but show comparable results to

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

3.3 Evaluation Results on Synthetic Data 45

Fig. 3.7 Scatterplots of dimension-wise prediction results (F1-scores) of logistic regression model trained on real count data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models.

the baseline method for NHIRD count data. On the other hand, the prediction performances of SVM as shown in Figure 3.9a, Figure 3.9b and Figure 3.9c, the proposed models slightly outperform medGAN or show comparable results to it for MIMIC-III and NHIRD count data but provide superior results as compared to the baseline method for extended MIMIC-III count data.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

46 Synthesizing Electronic Health Records

Fig. 3.8 Scatterplots of dimension-wise prediction results (F1-scores) of random forests model trained on real count data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models.

Summary of the prediction results

The prediction performances of the three generative models obtained from the results of the three predictive classifiers are also shown in Table 3.12. Table 3.13 summarizes the best generative models of the prediction tasks on various synthetic data. In the logistic regression prediction results, we observe that medBGAN shows better results for all binary datasets and medWGAN does better for MIMIC-III and NHIRD count datasets than the rest two generative models. In the random forests prediction results, medBGAN shows better results

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

3.3 Evaluation Results on Synthetic Data 47

Fig. 3.9 Scatterplots of dimension-wise prediction results (F1-scores) of SVM model trained on real count data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models.

than the rest two generative models except in extended MIMIC-III and NHIRD count datasets.

In SVM predictions, medBGAN always outperforms the other generative models, although in some cases the results are very close. From the Tables 3.12 and 3.13, we can say that our models (medBGAN and medWGAN) outperform the baseline model medGAN for each of the three predictive modeling tasks. In summary, we have achieved improvements maximum of 6.41% in logistic regression, 10.70% in random forests, and 8.19% in SVM as compared with the baseline model’s performances.

48 Synthesizing Electronic Health Records

Table 3.12 Prediction performances of the three generative models

Dataset Data type Generative model

Correlation coefficients (CCs) between synthetic and real data prediction results Logistic regression Random forest SVM

MIMIC-III

Table 3.13 Summary of prediction performances

Dataset Data type Best generative model of prediction Logistic regression Random forest SVM

MIMIC-III Binary medBGAN medBGAN medBGAN

Count medWGAN medBGAN medBGAN

This section firstly presents a summary of the evaluation results analyzed in the previous section. It then discusses the comparative results among the proposed models and the datasets.

It also describes the privacy issues of this study.

3.4.1 Summary of the results

A summary of our evaluation results is presented in Table 3.14. The table indicates the best model for each evaluation criterion of the synthetic datasets. As observed in section 3.3 (also in Table 3.14), the results of the evaluation methods, e.g., dimension-wise average and K–S test show that the synthetic EHR data generated by the proposed models are statistically sound. In most cases, we obtained better performance than the baseline medGAN model, and in a few cases, we had comparable results to it. For the predictive modeling tasks, we can notice that our models (medBGAN and medWGAN) showed excellent results and outperformed medGAN in all cases. For association rule mining, medBGAN did the best but medWGAN failed to outperform the baseline method. To summarize, in each case of the evaluations, our model, medBGAN outperforms the baseline model medGAN indisputably.

Although in some cases medWGAN did not work well, overall it performs better than medGAN.

3.4.2 MIMIC-III versus extended MIMIC-III

There was an important purpose of using two different MIMIC-III datasets in this study to investigate whether our proposed models can be applied to the dataset of several EHR data types simultaneously. For this reason, in addition to the MIMIC-III diagnoses dataset, we employed the extended MIMIC-III dataset that included both diagnoses and procedures EHR data. Table 3.14 shows that the evaluation results of the extended MIMIC-III dataset are

相關文件