• 沒有找到結果。

Evaluation Results on Synthetic Crime Data

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

4.3 Evaluation Results on Synthetic Crime Data 61

description and the evaluating procedure of each method have been discussed earlier in section 3.2.3.

4.3 Evaluation Results on Synthetic Crime Data

This section presents the evaluation results on synthetic crime data generated by the proposed models. As mentioned in previous sections, we used both the binary and count versions of the dataset. We applied three different generative models, namely medGAN, medWGAN, and medBGAN to generate synthetic data. The following are the comparative performance levels of three models in terms of producing synthetic crime data.

4.3.1 Dimension-wise average for binary data

Figure 4.2 shows the dimension-wise average of the synthetic binary data produced by the three different generative models. Each scatterplot displays the performance of one generative model. In the scatterplots of Figure 4.2, each dot represents one crime code. The

Fig. 4.2 Scatterplots of dimension-wise average count results on real binary data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models.

dimension-wise average for binary data can be referred to as dimension-wise probability.

The x-axis represents the Bernoulli success probability of each crime in real data, and the y-axis represents the success probability of each crime in synthetic data. The diagonal line indicates the ideal case when the performance of synthetic data is identical to that of

real data. To measure the performance of each generative model numerically, we used correlation coefficients (CCs) between real and synthetic data. Table 4.7 also lists the comparative performances of the generative models for crime binary data. It is observed that the proposed medWGAN and medBGAN yield slightly superior performance to the baseline model medGAN but the performances are very close to the highest mark (100%). Among the three models, medBGAN has the best performance.

Table 4.7 Dimension-wise average for crime data

Dataset Generative model Correlation

4.3.2 Dimension-wise average for count data

Figure 4.3 shows the dimension-wise average of the synthetic count data produced by the three different generative models. Each scatterplot displays the performance of one generative model. In the scatterplots of Figure 4.3, each dot represents one crime code. The x-axis

Fig. 4.3 Scatterplots of dimension-wise average count results on real count data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models.

4.3 Evaluation Results on Synthetic Crime Data 63

represents the average count of each crime in real data, and the y-axis represents the average count of each crime in synthetic data. The performances (CCs) of dimension-wise average for count data are also listed in Table 4.7. According to Figure 4.3 and Table 4.7, both medWGAN and medBGAN show a small improvement compared with medGAN for crime count data.

4.3.3 K–S test results

We applied the dimension-wise K–S test to examine whether a specific sample (say xi) of synthetic crime data and the corresponding sample (say yi) of the real data with the same dimension name originate from a population with the same distribution (1 or 0).

Then, we calculated the total percentage of similarity between the synthetic dataset and the corresponding real dataset. The derived results are summarized in Table 4.8. Table 4.8 shows the percentage of similarity between synthetic crime data generated by the three generative models and their real data counterparts. We observe that the proposed medBGAN outperforms medGAN for both binary and count data. But the medWGAN outperforms medGAN (and also medBGAN) only for count data. Overall, we have significant improvements in K–S test results on crime data, the maximum is 8.70%.

Table 4.8 K–S test results for crime data

Dataset Generative model K–S test similarity Crime Data (Binary)

As we described in section 3.2.3 and 3.3.4, we calculated the evaluation results of association rule mining for crime data and listed in Table 4.9. We found 170 rules from the real dataset using Apriori algorithm by setting minimum support = 0.01 and minimum confidence = 0.40. We maintained this same parameter setting for the corresponding synthetic datasets.

The number of rules found in all synthetic data, as well as the number of reproduced rules and percent of reproduced rules in synthetic data, are summarized in Table 4.9. From the association rule mining results, it is clear that medBGAN is able to reproduce most of the rules seen in the real data, 167 (98.23%). As compared with the baseline model, medBGAN has achieved 6.47% improvement. In contrast, medWGAN reproduces only 86 rules (50.58%) and fails to outperform the medGAN.

Table 4.9 Association rule mining results for crime data No. of

medGAN 500 156 91.76 %

medWGAN 98 86 50.58 %

medBGAN 411 167 98.23 %

4.3.5 Dimension-wise prediction performance

This section involves determining how well our synthetic data created by the generative models perform compared with the real data in machine learning prediction task. Here, we show the dimension-wise prediction performance of both binary and count variables for synthetic crime data using each of the three aforementioned (section 3.2.3) predictive models.

Figure 4.4 shows the dimension-wise prediction results of the three generative models

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

4.3 Evaluation Results on Synthetic Crime Data 65

corresponding real data. In the scatterplots of Figure 4.4a for binary data and Figure 4.4b for

Fig. 4.4 Scatterplots of dimension-wise prediction results (F1-scores) of logistic regression model trained on real data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models.

count data, each dot represents one crime code. The x-axis represents the F1-scores of the logistic regression model trained on the real data, and the y-axis represents the F1-scores of the logistic regression model trained on the synthetic data. The prediction performances, are calculated by the correlation coefficients (CCs) between the synthetic and real data prediction results (F1-scores), are also shown in the scatterplots. Like the prediction results of logistic regression model, Figure 4.5 and 4.6 show the prediction results of the other two machine learning classifiers, random forests and SVM respectively.

In the logistic regression prediction results, we observe that medBGAN and medWGAN show slightly better results than medGAN for binary data (Figure 4.4a). On the other hand, for count data our models outperform medGAN (Figure 4.4b). Note that medWGAN shows the comparable performance to medBGAN in this case. In the random forests prediction results, medBGAN shows better results than the rest two generative models for binary

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

66 Synthesizing Crime Data

Fig. 4.5 Scatterplots of dimension-wise prediction results (F1-scores) of random forests model trained on real data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models.

data (Figure 4.5a). But for count data both of the proposed models outperform medGAN (Figure 4.5b). In SVM predictions, only medBGAN outperforms medGAN for binary data (Figure 4.6a) but for count data both of the proposed models output the superior results as compare to medGAN (Figure 4.6b).

The prediction performances (i.e. correlation coefficients ) of the three generative models obtained from the results (F1-scores) of the three predictive classifiers are also shown in Table 4.10. Table 4.11 summarizes the best generative models of the prediction tasks on various synthetic data. From the Tables 4.10 and 4.11, we can say that our models (medBGAN and medWGAN) outperform the baseline model medGAN for predictive modeling tasks. In summary, we have achieved improvements maximum of 4.98% in logistic regression, 3.39%

in random forests, and 2.19% in SVM as compared with the baseline model’s performances.

4.3 Evaluation Results on Synthetic Crime Data 67

Fig. 4.6 Scatterplots of dimension-wise prediction results (F1-scores) of SVM model trained on real data (x-axis) versus synthetic counterpart (y-axis) produced by the three generative models.

Table 4.10 Prediction performances for crime data

Dataset Generative

model

Correlation coefficients (CCs) between synthetic and real data prediction results Logistic regression Random forest SVM Crime Data (Binary)

Table 4.11 Summary of the prediction performances for crime data Dataset Best generative model of prediction

Logistic regression Random forest SVM Crime Data (Binary) medWGAN /

medBGAN medBGAN medBGAN

Crime Data (Count) medWGAN / medBGAN

medWGAN / medBGAN

medWGAN / medBGAN

This section firstly presents a summary of the evaluation results analyzed in the previous section. It then discusses the comparative results among the proposed models.

4.4.1 Summary of the results

A summary of our evaluation results is presented in Table 4.12. The table indicates the best model for each evaluation criterion of the synthetic datasets. As observed in section 4.3 (also in Table 4.12), the results of the evaluation methods, e.g., dimension-wise average, and K–S test show that the synthetic crime data generated by the proposed models are statistically sound. In most cases, we obtained better performance than the baseline medGAN model, and in a few cases, we had comparable results to it for both binary and count data. For the predictive modeling tasks, we can notice that our models (medBGAN and medWGAN) showed excellent results and outperformed medGAN in all cases. For association rule mining, medBGAN did the best but medWGAN failed to outperform the baseline method.

To summarize, in each case of the evaluations, our model, medBGAN outperforms the baseline model medGAN indisputably. On the other hand, in some cases of crime binary data, medWGAN did not work well as compare to medGAN, but overall, it performs better than the baseline model.

4.4.2 medWGAN versus medBGAN

We observe that for synthetic count data, medWGAN outperforms medBGAN only in K–S test and shows little improvement or comparable performances to medBGAN in dimension-wise prediction. For all the rest evaluations, medBGAN yields the best performance. Nev-ertheless, to make this assertion more strong, we analyze the models’ performance from a different perspective here, i.e. the total number of all-zero dimensions in the synthetic

Table 4.12 Results summary for crime data

Dataset Evaluation criteria Best generative

model

data. While generating the synthetic data, we found that some crimes rarely occurred in our real dataset; that is, some dimensions (columns) consisted of very few nonzero values.

For these dimensions, the models might have generated synthetic data with some all-zero dimensions. Table 4.13 lists these statistics for count datasets. The table 4.13 shows that for the synthetic datasets with count variables, medWGAN generates more dimensions (44.92%) with all-zeros than medBGAN (15.94%) does. Therefore, overall, we can conclude that the proposed medBGAN outperforms medWGAN and also the baseline medGAN.

Table 4.13 All-zero dimensions in crime data Crime Dataset

(count variables) # of dimensions with all-zeros

Original (Real) data 0 (0.0 %)

medGAN Synthetic data 29 (21.01 %)

medWGAN Synthetic data 62 (44.92 %)

medBGAN Synthetic data 22 (15.94 %)

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Chapter 5

Concluding Remarks

5.1 Limitations and Future Works

This research mainly focused on synthesizing discrete EHR data of patients’ diagnoses and procedures data as a case study, although it can be applied to any type of discrete EHR data, because we did not use any diagnosis-specific or procedure-specific knowledge during GAN training. The future work would be to collect different types of EHR data in the healthcare domain and to apply our method in generating more realistic synthetic data other than generated in this study.

We mentioned in the evaluation results on synthetic data that the proposed model medW-GAN could not perform well as compared with the baseline model specifically in association rule mining. We observed that when there are many dimensions with the low average count in the dataset (e.g., NHIRD dataset), medWGAN cannot capture the data distribution accurately.

We will further investigate this issue and try to enhance the model’s capability.

When the dataset is larger in size (e.g., NHIRD dataset), our proposed method takes a little longer to train the model, although it still can complete computation within a realistic timeframe. In future, we will work to reduce this computation time. Our another future effort

is to develop a web-based automated tool that would be publicly available to facilitate the creation of synthetic data more easily and simply.

相關文件