差集生成網路--新穎資料生成

(1)

國立臺灣大學電機資訊學院電信工程學研究所碩士論文

Graduate Institute of Communication Engineering College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

差集生成網路–新穎資料生成

DifferenceSeeking Generative Adversarial Network–

Unseen Data Generation

宋易霖 YiLin Sung

指導教授：貝蘇章博士 Advisor: SooChang Pei, Ph.D.

中華民國 108 年 5 月

May, 2019

(2)

(3)

誌謝

我很感謝在碩士這兩年時光遇到的每個人，讓我發現即使論文投稿一直失敗，這段旅程還是很充實的。

第一個要感謝的是貝蘇章教授，謝謝你當初願意收一個化工系的學生進入實驗室，老師耐心的指導使我成長，勤奮不懈的身影是最好的身教。畢業後我也會抱持著同樣的精神，努力成為一個在科技產業上有貢獻的人，當然第一步還是希望能投稿成功…祝老師退而不休快樂!

我也要謝謝李宏毅教授，修了老師的兩門課讓我進入深度學習領域的大門。後來擔任助教的時候很佩服老師對知識的堅持以及對實驗敏銳的觀察。未來繼續在這條路上希望還能和老師有所交流。

這本論文的完成很大的部份必須歸功於謝松憲學長，除此之外，碩二每週和學長討論絕對是讓我進步最大的原因。學長讓我了解數學基礎對研究的重要性，也因此我修了幾門回想起來很痛苦，但收穫良多的數學課。學長對於研究的想法和知識的細節也讓我非常欽佩。我們一定要把這個研究成果投稿到 AI 的頂會上！

當然也要感謝實驗室的同學以及學弟妹。感謝你們罩我 DSP，不然後果不堪設想。也謝謝你們包容我這個難熟的人，直到快畢業了才感覺跟大家變熟，真的有點可惜。未來不知道會不會常見面，但如果需要我而我有能力的話，我一定大力相挺。

最後要老套的感謝一下我的家人以及女友，謝謝你們一直相信一個明明不太強的我。我會努力的朝自己的目標邁進，回饋社會也回饋你們。

…寫於 2019 年 7 月 15 日 …

(4)

(5)

Acknowledgements

I’m glad to thank everyone I met during these two years.

(6)

(7)

摘要

新穎資料泛指那些不落在訓練資料的分佈中的資料，而他們在某些應用是很重要的，如半監督學習、增強網路的穩定性和異常偵測等。

新穎資料通常難以取得，但是如果能夠有演算法能夠產生這些資料並在訓練時使用，那麼將可以大幅增強模型。因此如何產生這些資料是一個常見的研究議題。不同應用所需要的新穎資料往往不太相同，目前針對各種應用也有不同的方法。在這篇論文中，我們提出一個演算法差集生成對抗網路，能夠產生各種新穎資料。我們發現新穎資料所在的分佈常常是兩個已知分佈的差集，而這兩個已知分佈的資料是比較容易蒐集到的，甚至都可以從訓練資料變化而來。我們將差集對抗網路應用在半監督學習、加強深度網路的穩定性以及異常偵測，實驗結果證明我們的方法是有效的。除此之外，我們也提供理論的證明保證演算法的收歛性。

關鍵字：差集學習、生成對抗網路、半監督式學習、強健的深度網路、異常偵測

(8)

(9)

Abstract

Unseen data, which are not samples from the distribution of training data and are difficult to collect, have exhibited the importance in many appli

cations (e.g., novelty detection, semisupervised learning, adversarial train

ing and so on.). In this paper, we introduce a general framework, called DifferenceSeeking Generative Adversarial Network (DSGAN), to create var

ious kinds of unseen data. The novelty is to consider the probability density of unseen data distribution to be the difference between those of two distri

butions pd¯and p_d, whose samples are relatively easy to collect. DSGAN can learn the target distribution p_t (or the unseen data distribution) via only the samples from the two distributions p_dand pd¯. Under our scenario, p_dis the distribution of seen data and pd¯can be obtained from p_d via simple opera

tions, implying that we only need the samples of pd during training. Three key applications, semisupervised learning, increasing the robustness of neu

ral network and novelty detection, are taken as case studies to illustrate that DSGAN enables to produce various unseen data. We also provide theoretical analyses about the convergence of DSGAN.

Keywords: DifferenceSeeking, Generative Adversarial Network, SemiSupervised Learning, Robustness of Neural Network, Novelty Detection

(10)

(11)

List of Figures

1.1 Illustration of the differences between traditional GAN and DSGAN. . . . 3

2.1 The workflow of GAN. (Source: https://medium.freecodecamp.org/an

intuitiveintroductiontogenerativeadversarialnetworksgans7a2264a81394) 7

3.1 Complement points (in Green) between 2 circles (in Orange). . . 13 3.2 Boundary points (in Green) among 4 circles (in Orange). . . 13 3.3 The illustration about generating unseen data in boundary around train

ing data. First, the convolution of p_d and normal distribution makes the density on boundary be no longer zero. Second, we seek p_gsuch that Eq.

(3.1) holds, where the support set of p_gis approximated by the difference of those between pd¯and of pd. . . 14 3.4 Illustration of differenceset seeking in MNIST. . . 15 3.5 DSGAN learns the difference between two sets. . . 15 3.6 Demonstrate the influence of α on the synthetic dataset. In this example,

p_dis the orange rectangle (bounded by x <= 1, x >= −1, y >= −0.8 and y <= 0.8), and pd¯is the rectangle which is shifted pdright by 1 unit (not appear in the figures). We can observe that pgis farther away from pd

(green points) when α increases. When α is 0.5, p_g learns perfect differ

ence between pd¯and p_d. When α is 0.95, p_ggenerates the rightmost points of pd¯. The contour is the output of the discriminator, the place with higher score the generator going. Note that the outputs of the discriminator are not restricted in [0, 1], because we use WGAN’s structure in this experiment. 18

(14)

3.7 Extra 2D results for boundary sample generation. The orange points are data points, and the green points are generated points. . . 19 3.8 Difference set generation for CelebA dataset ([1]). pd¯ is 20000 images

from CelebA dataset. In pd¯, 1000 images are humans with glasses while others are ones without glasses. p_d all contains human wearing glasses, and its size is 19000. In this case, our generator successfully learned to produce images which are human with glasses. Note that α = 0.95. . . . 20 5.1 Demonstration for the adversarial example. Adding a special noise to the

panda image can change the prediction of the model to “gibbon”. More

over, the noise on the adversarial example is unperceivable for human. . . 29 6.1 Accuracy of baseline and our models after attacks. Blue line indicates

the first baseline model. Orange, green and red lines denote the second baseline models with different ranges of uniform noise. Purple, brown and pink lines indicate our methods. In the legend, the float number (0.01, 0.03 and 0.05) also indicates the variance of noises, and “w1” means that w in (6.3) is set to 1. “epsilon” means the ℓ₂ (or ℓ_inf) norm between the original image (pixel values are normalized to a range of [−0.5, 0.5]) and corresponding adversarial example. . . 41 6.2 The setting is the same with Fig. 6.3 unless w = 3. . . . 42 6.3 The setting is the same with Fig. 6.3 unless w = 10. . . . 42 6.4 Comparison of the reconstructed results of VAE and our method. Seen

class, which is at the bottom of the images, is car. Other rows are images from unseen classes. Our method exhibits a relatively larger gap, in terms of reconstruction error between seen data and unseen data, than VAE. . . 45

(15)

List of Tables

6.1 Semisupervised learning results on MNIST whether to use the sampling tricks. . . 35

6.2 Comparison of semisupervised learning between our DSGAN and state

oftheart methods: CatGAN [2], TripleGAN [3], FM [4], badGAN [5]

and CTGAN [6]. For a fair comparison, we only consider the GAN

based methods. ∗ indicates the use of the same architecture of classifier.

† indicates a larger architecture of classifier. ‡ indicates the use of data augmentation. The results for MNIST are recorded in the number of errors while the others are in percentage of errors. . . 36

6.3 Hyperparameters in semisupervised learning. . . 36

6.4 Network architectures for semisupervised learning on MNIST. (GN: Gaus

sian noise) . . . 38

6.5 The architectures of generator and discriminator for semisupervised learn

ing on SVHN and CIFAR10. N was set to 128 and 192 for SVHN and CIFAR10, respectively. . . 38

6.6 The architecture of classifiers for semisupervised learning on SVHN and CIFAR10. (GN: Gaussian noise, lReLU(leak rate): LeakyReLU(leak rate)) 39

6.7 The architecture of classifier for robustness enhancement of deep net

works on CIFAR10. (lReLU(leak rate): LeakyReLU(leak rate)) . . . 43

(16)

6.8 Comparison between our method (VAE+DSGAN) and stateoftheart meth

ods: VAE [7], AND [8], DSVDD [9], and OCGAN [10]. The results for Cifar10 were recorded in terms of AUC value. The number in the top row indicates the seen class, where 0: Plain, 1: Car, 2: Bird, 3: Cat, 4:

Deer, 5: Dog, 6: Frog,7: Horse, 8: Ship, 9: Truck. . . 45 6.9 The architectures of generator and discriminator in DSGAN for novelty

detection. . . 46 6.10 The architectures of VAE for novelty detection. . . 46

(17)

Chapter 1 Introduction

Unseen data are not samples from the distribution of training data and are difficult to collect. It has been demonstrated that the unseen samples can be applied to several ap

plications. [5] proposed how to create complement data and theoretically showed that complement data, considered as unseen data, can improve the semisupervised learning.

In novelty detection, [11] proposed a method to generate unseen data and used them to train a anomaly detector. Another related issue is adversarial training [12], where classifiers are trained to resist against adversarial examples, which are unseen during the training phase.

However, the aforementioned methods only focus on producing specific kind of unseen data instead of enabling to generate general types of unseen data.

In this paper, we propose a general framework, called DSGAN, to generate a vari

ety of unseen data. DSGAN is one of the generative approaches. In tradition, generative approaches, which are usually conducted in an unsupervised learning manner, are de

veloped for learning data distribution from its samples and thereafter produce novel and highdimensional samples, such as synthesized image, from learned distributions [13].

The stateoftheart approach is socalled Generative Adversarial Networks (GAN) [14].

GAN produces sharp images based on a gametheoretic framework, but can be tricky and unstable to train due to multiple interacting losses. Specifically, GAN consists of two functions: generator and discriminator. Both functions are represented as parameterized neural networks. The discriminator network is trained to classify whether or not inputs belong to the real data set or fake data set created by the generator. The generator learns to

(18)

map a sample from a latent space to some distribution to increase the classification errors of the discriminator.

Nevertheless, if we aim to learn a generator to create unseen data, traditional GAN requires preparing plenty of training samples of unseen classes for training, leading to the contradiction with the definition of unseen data. This fact motivates us to present DSGAN, which can generate unseen data by taking seen data as training samples (see Fig. 1.1, which illustrates the difference between GAN and DSGAN). The key idea is to consider the distribution of unseen data as the difference between two distributions, in which both are relatively easy to obtain. For example, outofdistribution examples in the MNIST dataset, from another point of view, are found to belong to the difference between the set of examples in MNIST and the universal set. It should be noted that the target distribution is equal to the training data distribution in traditional GAN; however, these two distributions, target distribution and training data distribution, are considered different in DSGAN.

In this paper, we make the following contributions:

(1) We propose DSGAN to generate any unseen data only if the density of target (unseen data) distribution is the difference between those of any two distributions, pd¯and pd. By contrast, traditional GAN fails to learn the difference between two distributions.

(2) We show that DSGAN possesses the flexibility to learn different target (unseen data) distributions in three key applications, semisupervised learning, increasing the robustness of neural network and novelty detection. Specifically, for novelty detection, DSGAN can produce boundary points around seen data because this kind of unseen data is easily misclassified. DSGAN also generates boundary samples to increase the robustness of neural network, but the distance measured in ℓ_inf norm.

For semisupervised learning, unseen data are the linear combination of any labeled data and unlabeled data, excluding labeled and unlabeled data themselves¹.

(3) Our theoretical analysis shows that, with enough capacity of the generator and the

1The linear combination of any labeled data and unlabeled data probably belongs to the set of seen data (labeled data and unlabeled data), which contradicts with the definition of unseen data. Thus, the samples generated by DSGAN should not include seen data themselves.

(19)

discriminator, the generator can learn the target distribution pt, whose support set is the difference of support sets between pd¯and p_d, under mild conditions.

Figure 1.1: Illustration of the differences between traditional GAN and DSGAN.

(20)

(21)

Chapter 2 Backgrounds

2.1 Deep Generative Model

A deep generative model is to learn the underlying data distribution P_X from limited train

ing dataX via neural network. The learned data distribution can be apply to several sce

narios, e.g. classification problem, representation learning, compressed sensing, etc. One can view that to learn a generative model is similar to teach machine understanding the world. Recently, there are some wellknown algorithms for deep generative models, in

cluding Variational Autoencoders (VAE) ([7]), Generative Adversarial Networks (GAN) ([14]), autoregressive models ([15]), and normalizing flow models ([16]). In this thesis, I focus on studying GANs.

2.2 Generative Adversarial Network

GAN is one of the popular framework in generation in recent years and it is successfully apply to various applications ([17], [18], [19]). GAN sets up a minmax game between the generator and discriminator. Generator tries to learn the data distribution to fool the discriminator while the discriminator learns to distinguish the input coming from the data distribution or the generator.

To learn the generator, one has to define a prior distribution p_z of the input variable z, and p_z isU(0, 1) or N(0, 1) in most of the time. Generator is a differentiable mapping

(22)

function G with output space G(z; θg). The goal of the generator is to find a optimal θ_g to let the distribution of G(z; θ_g) (that is, p_g) equal to the data distribution p_X. The discriminator is define as D(x; θ_d) that outputs scalar value. D(x) can be viewed as the probability of that x is from p_X. In other words, the discriminator is a binary classifier to distinguish the input is from p_X or p_g. The workflow of GAN is demonstrate in Fig. 2.1.

One train D to maximize the probability that assign 1 to the x and 0 to G(z) (or x_g).

In the mean while, G is being trained to maximize D(x_g). This procedure can be treated as a minmax game between G and D. More specifically, the objective function of GAN is

F (G, D) =Ex∼pd[log (D(x))] +Ez∼pz[log (1− D(G(z)))] (2.1) And one can optimize 2.1 by updating G and D, that is

minG max

D F (G, D) .

Under mild assumptions, optimizing 2.1 equals to minimize the JensenShannon di

vergence between p_X and pg. The global optimum is reached only if pg = p_X. The detailed proof is in [14].

In 2.1, the generator is going to be minimized until the discriminator is at optimal.

However, this training procedure is inefficient. Therefore, [14] claims that the discrim

inator only need constant multiple times (e.g. 5) updating steps per generator updating step. [20] points out that use higher learning rate for discriminator also achieve similar result. With this technique, we can alternatively train G and D each for one iteration, and it can shorten the training time.

2.3 Wasserstein GAN

It is demonstrated that there are some problems in training GAN: unstable training process, gradient vanishing problem, mode collapse in generator, etc. Lots of works are trying to address those issues. [21] proposed the convolutional architectures for both generator and

(23)

Figure 2.1: The workflow of GAN. (Source: https://medium.freecodecamp.org/an

intuitiveintroductiontogenerativeadversarialnetworksgans7a2264a81394)

discriminator to stabilize training. [22] define an alternative loss function and the proposed GAN can not only generate high quality images but also have more stable training process.

[23] mathematically analyze the training dynamics of training GAN. Moreover, they state that the unstable issues of GAN comes from the objective function. Based on [24], [25]

proposed Wasserstein GAN with a modified objective function,

W (G, D) =Ex∼pd[D (x)]− Ez∼pz[D (G (z))] (2.2)

Same as what we do in training GAN, we update G and D to minimize and maximize 2.2 respectively, that is,

minG max

∥D∥L≤1W (G, D)

where∥D∥_L≤ 1 denotes that D meets the 1Lipschitz continuity.

Training WGAN can be viewed as minimizing the Wasserstein distance (EarthMover distance) between p_X and p_g. WGAN is exhibited to be more stable and it can cover more modes of the training data. The reason why Wasserstein distance works better than JensenShannon divergence in training GAN refer to [24] and [25].

In order to let WGAN success, one has to constrain the Lipschitz continuity of the discriminator. In [25], the author use the weight clipping to enforce the condition. How

ever, the method may lead some undesired behavior. [26] instead constrain the gradient norm of the discriminator’s output with respect to its input, and the proposed GAN called WGANGP (GP is the abbreviation of gradient penalty).

(24)

2.4 SemiSupervised Learning with GANs

Please refer to Sec. 5.1.

2.5 Robust Issue of Neural Networks

2.6 Novelty Detection by Reconstruction Method

2.7 Related Works

We introduce related works about generating unseen data.

[11] proposed a method to generate samples of unseen classes in the unsupervised manner via an adversarial learning strategy. But, it requires solving an optimization prob

lem for each sample, which undoubtedly lead to high computation cost. On the contrary, DSGAN has the capability to create infinite diverse unseen samples. [27] presented a new GAN architecture that can learn both distributions of unseen data from part of seen data and unlabeled data. But, unlabeled data must be a mixture of seen and unseen samples;

DSGAN does not require any unseen data instead. [5] aims to generate complementary samples (or outofdistribution samples) but assumes that indistribution can be estimated by a pretrained model such as PixelCNN++, which might be difficult and expensive to train. [28] uses a simple classifier to replace the role of PixelCNN++ in [5] such that the training is much easier and more suitable. Nevertheless, their method only focuses on generates unseen data surrounding the lowdensity area of seen data, but DSGAN is more flexible to generate different kinds of unseen data (e.g., the linear combination of seen data described in Sec.6.1). In addition, their method needs the label information of data while ours is fully unsupervised.

(25)

Related works about semisupervised learning and enhancing robustness of neural net

work are presented in Sec. 5

(26)

(27)

Chapter 3 Proposed MethodDSGAN

3.1 Formulation

We denote the generator distribution as p_g and training data distribution as p_d, both in a N dimensional space. Let pd¯be the distribution decided by user. For example, pd¯can be the convolution of p_dand normal distribution. Let p_t be the target distribution which the user is interested in, and it can be expressed as

(1− α)pt(x) + αp_d(x) = pd¯(x), (3.1)

where α ∈ [0, 1]. Our method, DSGAN, aims to learn pg such that p_g = p_t. Note that if the support set of p_dbelongs to that of pd¯, then there exists at least an α such that the equality in (3.1) holds. However, even though the equality does not hold, intuitively, DSGAN tries to learn pgsuch that pg(x)∼ pd¯(x)− αpd(x)

1− α with the constraint pg(x)≥ 0.

In other words, the generator is going to output samples located in highdensity areas of pd¯− αpd. Furthermore, we show that DSGAN can learn p_g, whose support set is the difference between those of pd¯and p_din Proposition 2.

At first, we formulate the generator and discriminator in GANs. The inputs z of the generator are drawn from p_z(z) in an M dimensional space. The generator function G(z; θ_g) : R^M → R^N represents a mapping to data space, where G is a differentiable function with parameters θ_g. The discriminator is defined as D (x; θ_d) :R^N → [0, 1] that

(28)

outputs a single scalar. D (x) can be considered as the probability that x belongs to a class of real data.

Similar to traditional GAN, we train D to distinguish the real data from the fake data sampled from G. Meanwhile, G is trained to produce realistic data as possible to mislead D. But, in DSGAN, the definitions of “real data” and “fake data” are different from those in traditional GAN. The samples from pd¯are considered as real but those from the mixture distribution between p_dand p_g are considered as fake. The objective function is defined as follows:

V (G, D) :=E_x∼pd¯(x)[log D(x)] + (1− α)E_z∼pz(z)[log (1− D (G (z)))] + αEx∼pd(x)[log (1− D(x))] .

(3.2)

We optimize (3.2) through a minmax game between G and D, that is,

minG max

D V (G, D) .

During the training procedure, an iterative approach like traditional GAN is to alternate between k steps of training D and one step of training G. In practice, minibatch stochastic gradient descent via back propagation is used to update θ_dand θ_g. In other words, for each of p_g, p_dand pd¯, m sample are required for computing gradients, where m is the number of samples in a minibatch. The training procedure is illustrated in Algorithm 1. DSGAN suffers from the same drawbacks with traditional GAN (e.g., mode collapse, overfitting, and strong discriminator) such that the generator gradient vanishes. There are literature [4, 24, 29] focusing on dealing with the above problems, and such ideas can be readily combined into DSGAN.

[3] and [30] proposed the similar objective function like (3.2). Their goal is to learn the conditional distribution of training data. Nevertheless, we aim to learn the target dis

tribution p_tin Eq. (3.1), not the training data distribution.

(29)

Algorithm 1 The training procedure of DSGAN using minibatch stochastic gradient de

scent. k is the number of steps applied to discriminator. α is the ratio between p_g and p_d in the mixture distribution. We used k = 1 and α = 0.8 in experiments.

01. for number of training iterations do 02. for k steps do

03. Sample minibatch of m noise samples z⁽¹⁾, ..., z^(m) from p_g(z).

04. Sample minibatch of m samples x⁽¹⁾_d , ..., x^(m)_d from p_d(x).

05. Sample minibatch of m samples x⁽¹⁾_d_¯ , ..., x^(m)_d_¯ from pd¯(x).

06. Update the discriminator by ascending its stochastic gradient:

∇θd

"

1 m

Xm i=1

log D

x⁽ⁱ⁾_d

+ log 1− D G z⁽ⁱ⁾

+ log

1− D x⁽ⁱ⁾_d_¯

i

07. end for

08. Sample minibatch of m noise samples z⁽¹⁾, ..., z^(m) from pg(z).

09. Update the generator by descending its stochastic gradient:

∇θg

1 m

Xm i=1

log 1− D G z⁽ⁱ⁾

10. end for

3.2 Case Study on Synthetic Data and MNIST

3.2.1 Case Study on Various Unseen Data Generation

To get more intuitive understanding about DSGAN, we conduct several case studies on 2D synthetic datasets and MNIST. α = 0.8 in Eq. (3.1) is used.

Figure 3.1: Complement points (in Green) between 2 circles (in Orange).

Figure 3.2: Boundary points (in Green) among 4 circles (in Orange).

Complement samples generation Fig. 3.1 illustrates that DSGAN is able to generate complement samples between 2 circles. Given the density function of the 2 circles as

(30)

Figure 3.3: The illustration about generating unseen data in boundary around training data.

First, the convolution of p_dand normal distribution makes the density on boundary be no longer zero. Second, we seek p_g such that Eq. (3.1) holds, where the support set of p_g is approximated by the difference of those between pd¯and of p_d.

p_d, we assign samples drawn from pd¯as the linear combinations of 2 circles. Then, by applying DSGAN, we achieve our goal to generate complement samples. In fact, this kind of unseen data is used in semisupervised learning.

Boundary samples generation Fig. 3.2 illustrates that DSGAN generates boundary points among 4 circles. This kind of unseen data is used in novelty detection. In this case, we assign p_dand pd¯as “the density function of 4 circles” and “the convolution of p_dand the normal distribution,” respectively. The intuition of our idea is also illustrated by an 1D example in Fig. 3.3.

Differenceset generation We also validate DSGAN on high dimensional dataset such as MNIST. In this example, we define p_dto be the distribution of digit “1” and pd¯to be the distribution containing both digits “1” and “7”. Since the density p_d(x) is high when x is digit “1,” the generator is prone to output digit “7” with high probability. The illustration of differenceset generation is demonstrated in Fig. 3.4 and 3.5.

From the above results, we can observe two properties of generator distribution p_g: i) the higher density of p_d(x), the lower density of p_g(x); ii) p_g prefers to output samples from highdensity areas of pd¯(x)− αpd(x).

(31)

In the next section, we will show that the objective function is equivalent to minimizing the JensenShannon divergence between the mixture distribution (p_dand p_g) and pd¯if G and D are given enough capacity.

1

7

1

7

Figure 3.4: Illustration of differenceset seeking in MNIST.

Figure 3.5: DSGAN learns the difference between two sets.

3.3 Discussions about the objective function of DSGAN

There are two main issues in 3.1. The first is that how the α influence the learned distri

bution p_g. From the objective function, one can imagine that the larger α will reduce the overlap between p_dand p_g. However, will the p_g be really far away from p_dif α is close to 1? Depending on the design of pd¯, the answer can be yes or no. Second, in some cases, it is possible that p_dis not fully contained in pd¯. In other words, pd¯(x)− αpd(x)

1− α can be negative for some x when α is large enough. In this section, we are going to demonstrate that the negative part will not influence the learning of generator.

We will discuss about the issues using Fig. 3.6.

The influence of α In Fig. 3.6, the overlapped area between pd¯and pdis 0.5 unit (let the area of p_d is 1 unit), so α = 0.5 is the smallest choice to let p_g is disjoint to p_d. As the α = 0.8, the generated points locate at the place which has a gap to yellow points. In theory, the result of α = 0.8 should be the same as that of α = 0.5, since the discriminator should give the whole area which is outside p_dbut inside pd¯same score. However, due to the continuity of the discriminator (continuity is the key point to make GAN work, such as

(32)

WGAN), the score of the area just beside the pdis lower than where is far from it, when one has larger α. Because of this reason, the p_gtend to keep away from p_d. At α equals to 0.95, we can find that p_g still locate inside pd¯. In pd¯(x)− αpd(x)

1− α , there must be some points making the expression are positive (when pd¯̸= pd), due toR

xp(x)dx = 1. Therefore, p_g will generate such points. In this example, one can observe that p_g is bounded by pd¯. As the support of pd¯is much larger than which of p_d, excessive α will let p_gstay away from p_d. On the other side, one can use smaller support of pd¯to make p_g close to p_d.

Negative density One can figure out that pdis not fully contained in pd¯in this example, while the rectangle which is bounded by x <= 0, x >=−1, y >= −0.8 and y <= 0.8, are in p_dbut not in pd¯. However, in the case which α = 0.5, we notice that the generator still generates perfect difference set even though there are some deficient places with neg

ative density. This can be explained through the intrinsic property of the discriminator.

By 3.2, the discriminator’s output of the points with negative density (x^′) tend to be 0.

Observing the first term in 3.2, since x^′ is not in pd¯, then no gradients will lead D(x^′) to arise to 1. D(x^′) = 0 meets our goal that pg don’t overlap with pd. Therefore, although there exists the negative density, the objective will not be effected.

3.4 Tricks for Stable Training

We provide a trick to stabilize the training procedure by reformulating the objective func

tion. Specifically, V (G, D) in (3.2) is reformulated as:

V (G, D) = Z

x

pd¯(x) log (D (x)) + ((1− α)pg(x) + αp_d(x)) log (1− D (x)) dx

=Ex∼pd^¯(x)[log D(x)] +Ex∼(1−α)pg(x)+α∼pd(x)[log (1− D (x))] .

(3.3)

Instead of sampling a minibatch of m samples from p_zand p_din Algorithm 1, (1−α)m and αm samples from both distributions are required, respectively. The computation cost in training can be reduced due to fewer samples. Furthermore, although (3.3) is equiva

lent to (3.2) in theory, we find that the training using (3.3) achieves better performance

(33)

than using (3.2) via empirical validation in Table 6.1. We conjecture that the equivalence between (3.3) and (3.2) is based on the linearity of expectation, but minibatch stochastic gradient descent in practical training may lead to the different outcomes.

3.5 Appendix: More Results for Case Study

Additional results for boundary samples generation and difference set generation are pre

sented in Fig. 3.7 and Fig. 3.8, respectively.

(34)

(a) α = 0.30 (b) α = 0.50

(c) α = 0.80 (d) α = 0.95

Figure 3.6: Demonstrate the influence of α on the synthetic dataset. In this example, p_d is the orange rectangle (bounded by x <= 1, x >=−1, y >= −0.8 and y <= 0.8), and pd¯is the rectangle which is shifted p_d right by 1 unit (not appear in the figures). We can observe that p_g is farther away from p_d(green points) when α increases. When α is 0.5, p_glearns perfect difference between pd¯and p_d. When α is 0.95, p_ggenerates the rightmost points of pd¯. The contour is the output of the discriminator, the place with higher score the generator going. Note that the outputs of the discriminator are not restricted in [0, 1], because we use WGAN’s structure in this experiment.

(35)

(a) S shape with compact data points. α = 0.9. (b) S shape with scattering data points. α = 0.9.

(c) 4 Gaussians. α = 0.9. (d) 8 Gaussians. α = 0.8.

Figure 3.7: Extra 2D results for boundary sample generation. The orange points are data points, and the green points are generated points.

(36)

Figure 3.8: Difference set generation for CelebA dataset ([1]). pd¯is 20000 images from CelebA dataset. In pd¯, 1000 images are humans with glasses while others are ones without glasses. p_d all contains human wearing glasses, and its size is 19000. In this case, our generator successfully learned to produce images which are human with glasses. Note that α = 0.95.

(37)

Chapter 4 Theoretical Results

There are two assumptions for subsequent proofs. First, in a nonparametric setting, we assume both generator and discriminator have infinite capacity. Second, p_g is defined as the distribution of the samples drawn from G(z) under z ∼ pz. We will first show the optimal discriminator given G and then show that minimizing V (G, D) via G given the optimal discriminator is equivalent to minimizing the JensenShannon divergence between (1− α)pg+ αp_dand pd¯.

Proposition 1. For G being fixed, the optimal discriminator D is

D^∗_G(x) = p_d(x)

p_d(x) + (1− α)pg(x) + αp_d(x).

Proof. Given any generator G, the training criterion for the discriminator D is to maximize the quantity V (G, D):

V (G, D) = Z

x

pd¯(x) log (D (x)) dx + (1− α) Z

z

pz(z) log (1− D (G (z))) dz + α

Z

x

p_d(x) log (1− D (x)) dx

= Z

x

pd¯(x) log (D (x)) dx + (1− α) Z

x

p_g(x) log (1− D (x)) dz + α

Z

x

p_d(x) log (1− D (x)) dx

= Z

x

pd¯(x) log (D (x)) + ((1− α)pg(x) + αpd(x)) log (1− D (x)) dx.

(38)

For any (a, b)∈ R²\{0, 0}, the function a log (y)+b log (1 − y) achieves its maximum in [0, 1] at y = _a+b^a . The discriminator only needs to be defined within Supp(pd¯)S

Supp(p_d)S

Supp(p_g).

We complete this proof.

Moreover, D can be considered to discriminate between samples from pd¯and those from ((1− α)pg(x) + αp_d(x)). By replacing the optimal discriminator into V (G, D), we obtain

C(G) = max

D V (G, D)

=Ex∼pd^¯(x)[log D_G^∗(x)] + (1− α)Ez∼p^z(z)[log (1− D^∗_G(G (z)))]

+ αEx∼pd(x)[log (1− DG^∗(x))]

=Ex∼pd^¯(x)[log D_G^∗(x)] +Ex∼p^∗(x)[log (1− D^∗G(x))]

=Ex∼pd^¯(x)

log pd¯(x)

pd¯(x) + (1− α)pg(x) + αp_d(x)

+Ex∼p^∗(x)

log (1− α)pg(x) + αpd(x) pd¯(x) + (1− α)pg(x) + αp_d(x)

,

(4.1)

where p^∗(x) = (1−α)pg(x) + αpd(x) and the third equality holds because of the linearity of expectation.

Actually, the results so far show the optimal solution of D given G being fixed in (4.1).

Now, the next step is to find the optimal G with D^∗_Gbeing fixed.

Theorem 1. The global minimum of C(G) is achieved if and only if (1− α)pg(x) + αp_d(x) = pd(x)¯ for all x’s. At that point, C(G) achieves the value− log 4.

(39)

Proof. We start from

(4.1) =− log(4) +Ex∼pd^¯(x)

log 2pd¯(x)

pd¯(x) + (1− α)pg(x) + αp_d(x)

+Ex∼p^∗(x)

log 2 ((1− α)pg(x) + αpd(x)) pd¯(x) + (1− α)pg(x) + αp_d(x)

=− log(4) + KL

pd¯

pd¯+ (1− α)pg+ αp_d 2

+ KL

(1− α)pg(x) + αp_d

pd¯+ (1− α)pg+ αp_d 2

=− log(4) + 2 JSD (pd^¯∥ (1 − α)pg + αp_d) ,

where p^∗(x) = (1− α)pg(x) + αpd(x), KL is the KullbackLeibler divergence and JSD is the JensenShannon divergence. The JSD returns the minimal value, which is 0, iff both distributions are the same, namely pd¯ = (1− α)pg + αp_d. Because p_g(x)’s are always nonnegative, it should be noted both distributions are the same only if αp_d(x) ≤ pd^¯(x) for all x’s. We complete this proof.

Note that (1−α)pg(x)+αp_d(x) = pd(x)¯ may not hold if αp_d(x) > pd(x)¯ . But, DSGAN still works based on two facts: i) given D, V (G, D) is a convex function in p_gand ii) due to

Z

x

p_g(x)dx = 1, the set collecting all feasible solutions of p_gis convex. In other words, there always exists a global minimum of V (G, D) given D, but it may not be− log(4).

In the following, we show that the support set of p_g is contained within the difference of support sets between pd¯and pdwhile achieving the global minimum such that we can generate the desired p_g by designing appropriate pd¯.

Proposition 2. Suppose αpd(x) ≥ pd^¯(x) for all x ∈ Supp(pd) and all density functions p_d(x), pd¯(x) and p_g(x) are continuous. If the global minimum of C(G) is achieved, then

Supp (p_g)⊆ Supp (pd^¯)− Supp(pd).

(40)

Proof. Recall

C(G) = Z

x

pd¯(x) log

pd¯(x)

pd¯(x) + (1− α)pg(x) + αp_d(x)

+ p^∗(x) log

(1− α)pg(x) + αp_d(x) pd¯(x) + (1− α)pg(x) + αp_d(x)

dx

= Z

x

S(p_g; x)dx

= Z

x∈Supp(pd^¯)−Supp(pd)

S(pg; x)dx + Z

x∈Supp(pd)

S(pg; x)dx

S(p_g; x) is to simplify the notations inside the integral. For any x, S(p_g; x) in p_g(x) is nonincreasing and S(pg; x)≤ 0 always holds. Specifically, S(pg; x) is decreasing along the increase of pg(x) if pd¯(x) > 0; S(pg; x) attains the maximum value, zero, for any pg(x) if pd¯(x) = 0. Since DSGAN aims to minimize C(G) with the constraint

Z

x

p_g(d)dx = 1, the solution attaining the global minima must satisfy p_g(x) = 0 if pd¯(x) = 0; otherwise, there exists another solution with smaller value of C(G). Thus, Supp (p_g)⊆ Supp (pd^¯).

Furthermore, T (p_g; x) = ∂S(p_g; x)

∂pg(x) = log

(1− α)pg(x) + αp_d(x) pd¯(x) + (1− α)pg(x) + αpd(x)

, which is expected to be as small as possible to minimize C(G), is increasing on p_g(x) and con

verges to 0. Then, we show that T (p_g; x) for x ∈ Supp(pd^¯)T

Supp(p_d) is always larger than that for x∈ Supp(pd^¯)− Supp(pd) for all p_g. Specifically,

1. When x∈ Supp(pd^¯)T

Supp(p_d), T (p_g; x)≥ log¹₂ always holds due to the assump

tion αp_d(x)≥ pd^¯(x).

2. When x ∈ Supp(pd^¯)− Supp(pd), T (p_g; x) < log¹₂ for all p_g(x)’s satisfying (1− α)p_g(x)≤ pd(x)^¯ .

Thus, the minimizer prefers p_g(x) > 0 for x ∈ Supp(pd^¯)− Supp(pd) and (1− α)pg(x)≤ pd(x)¯ . We check whether there exists a solution p_g such that (1− α)pg(x) ≤ pd(x)^¯ and Z

x∈Supp(pd¯)−Supp(pd)

p_g(d)dx = 1, implying p_g(x) = 0 for x ∈ Supp(pd^¯)T

Supp(p_d).

(41)

Based on the following expression, Z

pd¯(x)dx + Z

x∈Supp(pd)

pd¯(x)dx = 1

⇒ Z

pd¯(x)dx ≥ 1 − Z

x∈Supp(pd)

αp_d(x)dx

⇒ Z

x∈Supp(pd¯)−Supp(pd)

pd¯(x)dx ≥ 1 − α

⇒ Z

pd¯(x)dx ≥ Z

(1− α)pg(x)dx,

the last inequality implies that there must exist a feasible solution. We complete this proof.

In sum, the generator is prone to output samples located in highdensity areas of pd¯− αp_d.

Another concern is the convergence of Algorithm 1.

Proposition 3. The discriminator reaches its optimal value given G in Algorithm 1, and pg is updated by minimizing

Ex∼pd^¯(x)[log D^∗_G(x)] +Ex∼p^∗(x)[log (1− DG^∗ (x))] .

If G and D have enough capacity, then p_gconverges to argmin

pg

JSD (pd¯∥ (1 − α)pg+ αp_d).

Proof. Consider V (G, D) = U (p_g, D) as a function of p_g. By the proof idea of Propo

sition 2 in [14], if f (x) = sup_α_∈Af_α(x) and f_α(x) is convex in x for every α, then

∂f_β(x) ∈ ∂f if β = argsupα∈Af_α(x). In other words, if sup_DV (G, D) is convex in p_g, the subderivatives of sup_DV (G, D) includes the derivative of the function at the point, where the maximum is attained, implying the convergence with sufficiently small updates of p_g. We complete this proof.

(42)

(43)

Chapter 5 Applications

DSGAN have been applied to three problems: semisupervised learning, robustness en

hancement of deep networks and novelty detection. As for semisupervised learning, DS

GAN acts as a “bad generator,” which creates complement samples in the feature space of real data, while DSGAN generates adversarial examples located in the lowdensity areas of training data for robustness enhancement. For novelty detection, DSGAN generates samples (unseen data) as the boundary points around training data.

5.1 SemiSupervised Learning

Semisupervised learning (SSL) is a kind of learning model with the use of a small number of labeled data and a large amount of unlabeled data. The existing SSL methods based on generative model (e.g., VAE [31] and GAN [4]) obtain good empirical results. [5] theo

retically shows that good semisupervised learning requires a bad GAN with the objective function:

max

D Ex,y∼Llog P_D(y | x, y ≤ K) + Ex∼pd(x)log P_D(y ≤ K | x) +Ex∼pg(x)log Pg(K + 1 | x) ,

(5.1)

where (x, y) denotes a pair of data and its corresponding label,{1, 2, . . . , K} denotes the label space for classification, and L = {(x, y)} is the label dataset. Moreover, in the

(44)

semisupervised settings, pd in (5.1) is the distribution of unlabeled data. Note that the discriminator D in GAN also plays the role of classifier. If the generator distribution exactly matches the real data distribution (i.e., p_g = p_d), then the classifier trained by the objective function (5.1) with the unlabeled data cannot have better performance than that trained by supervised learning with the objective function:

maxD Ex,y∼Llog P_D(y | x, y ≤ K) . (5.2)

On the contrary, the generator is preferred to generate complement samples, which lie on the lowdensity area of p_d. Under some mild assumptions, those complement samples help D to learn correct decision boundaries in the lowdensity area because the probabil

ities of true classes are forced to be low on outofdistribution areas.

The complement samples in [5] are complicate to produce. We will demonstrate that DSGAN is easy to generate complement samples in Sec. 6.

5.2 Robustness Enhancement of Deep Networks

Deep neural networks have impacted on our daily life. Neural networks, however, are vulnerable to adversarial examples, as evidenced in recent studies [32, 33]. Thus, there has been significant interest in how to enhance the robustness of neural networks. Un

fortunately, if the adversary has full access to the network, namely whitebox attack, a complete defense strategy has not yet been found.

In the research papers, it’s not hard to see a well trained deep neural model reaching more than 90% accuracy on a classification task. It seems that the machine beats human in recognition nowadays. Nonetheless, the machine vision is surprisingly fragile. For most of the inputs which are classified correctly, one can constructed an adversarial example by adding a specific noise to original input, to make the predicted results totally wrong.

In most of the time, the original image and its adversarial example are undistinguished for human, such as Fig. 5.1.

One can create an adversarial example through 5.3, where p is 2 or inf, typically. Intu

(45)

Figure 5.1: Demonstration for the adversarial example. Adding a special noise to the panda image can change the prediction of the model to “gibbon”. Moreover, the noise on the adversarial example is unperceivable for human.

itively, optimizing the equation is to find a perturbation for the input to let the model less likely to predict the correct label.

∥δ∥maxp≤ϵℓ (x + δ; y; C_θ) (5.3)

[34] surveyed the stateoftheart defense strategies and showed that adversarial train

ing [35] is more robust than other strategies. Given adversarial examples and a trained classifier C parameterized with θ and a loss function ℓ (x; y; Cθ), adversarial training solves a minmax game, where the first step is to find adversarial examples within ϵball for maximizing the loss, and the second step is to train the model for minimizing the loss.

Specifically, the objective in [35] is

argmin

θ

E(x,y)∼L

max

δ∈[−ϵ, ϵ]^Nℓ (x + δ; y; C_θ)

. (5.4)

The authors used projected gradient descent (PGD) to find adversarial examples by max

imizing the inner optimization.

Adversarial training requires the pretrained classifiers to calculate PGD such that ad

versarial examples may be effective on specific classifiers. But, our DSGAN generates unseen data in the lowdensity area, which include adversarial examples, because [36]

pointed out that adversarial examples frequently locate in the lowdensity area. In other

(46)

words, the generated data in DSGAN are more universal but less effective for specific classifiers. In Sec. 6.2, we enhance the robustness of deep learning networks in a semi

supervised manner, which use generated samples of DSGAN to finetune C_θ. ϵball in terms of ℓ₂or ℓ_infcan be intuitively incorporated into the generation of adversarial exam

ples.

5.3 Novelty Detection

Novelty detection determines if a query example is from one seen class. If the samples of one seen class are considered as positive data, then this difficulty is the absence of neg

ative data in the training phase such that supervised learning cannot work. To overcome this problem, one of classical methods is the OneClass SVM (OCSVM) [37] that only requires positive data as training inputs. However, OCSVM often suffers from the curse of dimensionality due to bad computational scalability.

Recently, novelty detection has made a great progress with the advent of deep leaning.

[38][39] focus on learning a representative latent space for the one seen class. When testing, the query image is projected onto the learned latent space. Then, the difference between the query image and its inverse image (reconstruction) is measured. In other words, all we need is to train an encoder for projection and a decoder for reconstruction.

Under the circumstance, autoencoder (AE) usually is adopted to learn both encoder and decoder [38][10]. Let Enc(·) be the encoder and Dec(·) be the decoder, respectively. The loss function of AE is defined as:

min

Enc,DecEx∼ppos(x)

∥x − Dec(Enc(x))∥²2

, (5.5)

where p_pos is the distribution of one seen class. After training, a query example x_test is classified as the seen class if

∥xtest− Dec(Enc(xtest))∥²2 ≤ τ, (5.6)

差集生成網路--新穎資料生成

國立臺灣大學電機資訊學院電信工程學研究所 碩士論文

Graduate Institute of Communication Engineering College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

差集生成網路–新穎資料生成

Difference­Seeking Generative Adversarial Network–

Unseen Data Generation

宋易霖 Yi­Lin Sung

指導教授：貝蘇章博士 Advisor: Soo­Chang Pei, Ph.D.

中華民國 108 年 5 月

May, 2019

誌謝

Acknowledgements

摘要

Abstract

Contents

List of Figures

List of Tables

Chapter 1 Introduction

Chapter 2 Backgrounds

2.1 Deep Generative Model

2.2 Generative Adversarial Network

2.3 Wasserstein GAN

2.4 Semi­Supervised Learning with GANs

2.5 Robust Issue of Neural Networks

2.6 Novelty Detection by Reconstruction Method

2.7 Related Works

Chapter 3

Proposed Method­DSGAN

3.1 Formulation

3.2 Case Study on Synthetic Data and MNIST

3.2.1 Case Study on Various Unseen Data Generation

3.3 Discussions about the objective function of DSGAN

3.4 Tricks for Stable Training

3.5 Appendix: More Results for Case Study

Chapter 4

Theoretical Results

Chapter 5 Applications

5.1 Semi­Supervised Learning

5.2 Robustness Enhancement of Deep Networks

5.3 Novelty Detection

國立臺灣大學電機資訊學院電信工程學研究所碩士論文

DifferenceSeeking Generative Adversarial Network–

宋易霖 YiLin Sung

指導教授：貝蘇章博士 Advisor: SooChang Pei, Ph.D.

2.4 SemiSupervised Learning with GANs

Proposed MethodDSGAN

5.1 SemiSupervised Learning