AI-based Synthetic Data Generation

• Synthetic data are either free or inexpensive in respect of time and money. Once the synthetic environment is ready, producing synthetic data is more cost-effective and faster than collecting real data.

• Where real data do not exist, generally for training and testing the new systems, synthetic data are the only solution.

• Where real data are not sufficient for ensuring the system performance, synthetic data are generated to serve the desired purpose that represents every possible situation.

• Synthetic data aim to preserve the multivariate relationships between variables instead of specific statistics alone.

• Synthetic data can have perfectly accurate labels, including labeling that may be very expensive or impossible to obtain by hand.

2.4 AI-based Synthetic Data Generation

A variety of synthetic data generation (SDG) methods and enterprise level tools have been developed across a wide range of domains. In section 1.2 (Related Works) of Introduction, we pointed out some earlier and recent researches of SDG. The next subsections describe the AI-based SDG methods – Generative adversarial networks (GANs), Wasserstein GAN with gradient penalty (WGAN-GP), boundary-seeking GAN (BGAN) and the baseline method – medical GAN (medGAN), which are the foundation frameworks of our proposed models.

2.4.1 Generative adversarial networks (GANs)

The idea of the Generative adversarial network (GAN) framework by Ian J. Goodfellow et al.

was introduced in the NIPS 2014 conference [29]. Yann LeCun, Director of AI Research at

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

2.4 AI-based Synthetic Data Generation 13

the variations that are now being proposed is the most interesting idea in the last 10 years in ML, in my opinion.”

The conceptual idea of a GAN architecture is shown in Figure 2.1. The main idea of GANs, as indicated by the authors, is to train two neural networks: a generator G that generates synthetic or fake samples from random noise, and a discriminator D that classifies whether those generated samples originate from the original data (real) or generator (fake). The training goal of G is to fool D into believing that the generated samples are real. On the other hand, D is trained rigorously with real and fake samples so that it can identify the samples from G as fake. By competing with each other between these two networks (thus the

“adversarial”), a GAN framework can produce realistic synthetic samples. This framework resembles a two-player minimax game [29, 50].

Fig. 2.1 The conceptual idea of GAN architecture.

A commonly used analogy is that the generator (G) is akin to a forger (criminal) trying to produce counterfeit money and that the discriminator (D) is akin to the police attempting to detect the counterfeit money. The objective of the criminal is to counterfeit money such that the police cannot discriminate the counterfeit money from real money. By contrast, the police want to detect the counterfeit money as best as possible. Formally, the minimax game

‧

between G and D with the value function V (G, D) is as follows:

min

G max

D V(D, G) = Ex∼p_data(x)[log D(x)] + Ez∼pz(z)[log(1 − D(G(z)))] (2.1) where p_data is the data distribution and p_z is the simple noise distribution (e.g., uniform distribution or spherical Gaussian distribution). Initially, G accepts a random prior z ∼ p_zand generates synthetic samples for the certification of D. G is then fine-tuned training (updated parameters) by using the error signal from D through back-propagation. Both D and G iterate in optimizing the respective parameters θ_d and θ_gas follows:

θ_d← θ_d+ α∇_θ_d 1

where m is the size of the mini-batch and α the step size. Another compact design of the GAN, equivalent to Figure 2.1, is shown in Figure 2.4a.

2.4.2 Wasserstein GAN with gradient penalty (WGAN-GP)

The authors of WGAN-GP model in [45] claimed that the previously developed model Wasserstein GAN (WGAN) [51] facilitates stable training but generates low-quality samples or fails to converge in some settings owing to the use of the weight-clipping technique. To overcome these issues, they offered an alternative method of weight clipping called gradient penalty, which entails penalizing the norm of the gradient of the discriminator (critic) with respect to its input. To demonstrate this, the authors trained WGAN critics with weight clipping and WGAN critics with gradient penalty to optimality on several toy distributions.

They proved that gradient penalty in WGANs did not exhibit undesired behavior like weight

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

2.4 AI-based Synthetic Data Generation 15

clipping. Weight clipping in WGANs failed to capture higher moments of the data distribution and exploded or vanished gradient norms during training. The results are shown in Figure 2.2.

(a) Value surfaces of WGAN critics trained to optimality on toy datasets using (top) weight clipping and (bottom) gradient penalty. Critics trained with weight clipping fail to capture higher moments of the data distribution. The

‘generator’ is held fixed at the real data plus Gaussian noise.

(b) (left) Gradient norms of deep WGAN critics during training on toy datasets either explode or vanish when using weight clipping, but not when using a gradient penalty. (right) Weight clipping (top) pushes weights towards two values (the extremes of the clipping range), unlike gradient penalty (bottom).

Fig. 2.2 Gradient penalty in WGANs does not exhibit undesired behavior like weight clipping.

2.4.3 Boundary-Seeking GAN (BGAN)

In GAN-based approach, a generator is trained to match a target distribution that converges toward the true distribution of the data as the discriminator is optimized. Under this

inter-‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

16 Materials and Methods

pretation, the learning objective of a generator is to minimize the difference between the discriminator’s log-probabilities for the sample being positive and negative. In boundary-seeking GAN (BGAN) [44], this objective has been inferred as training a generator to create samples that lie on the decision boundary of the current discriminator in training at each update. Hence, the GAN trained using this algorithm is called BGAN.

(a) Early stage of learning (b) Late stage of learning

Fig. 2.3 Qualitative comparison between the conventional GAN and the proposed BGAN in 1-D examples.

The BGAN authors qualitatively analyzed the difference between the conventional GAN and the proposed BGAN as shown in Figure 2.3. They considered an one-dimensional variable, and drew 20 samples as each of real and generated sets of samples from two Gaussian distributions considering two cases; (a) early stage of learning, and (b) late stage of learning. They used –2 and 2 in the first case, and –0.1 and 0.1 in the second case, for the centers of those two Gaussians. The variances were set to 0.3 for both distributions.

As shown in Figure 2.3a, the authors ploted both real and generated samples on the x-axis (y = 0). The solid red curve corresponds to the discriminator D(x) above, and its log-gradient (∂ log D(x)/∂ x) is drawn with a dashed red curve. It is clear that maximizing log D(x), as conventionally done with GAN, pushes the generator beyond the real samples (orange circles).

On the other hand, the proposed criterion of BGAN has its minimum at the decision boundary

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

在文檔中以進階生成對抗網路合成擬真資料 - 政大學術集成 (頁 35-40)

2.4 AI-based Synthetic Data Generation

2.4.1 Generative adversarial networks (GANs)

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

2.4.2 Wasserstein GAN with gradient penalty (WGAN-GP)

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

2.4.3 Boundary-Seeking GAN (BGAN)

inter-‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

立政治大學

立政治大學

立政治大學

立政治大學