國立臺灣大學電機資訊學院電信工程學研究所 碩士論文
Graduate Institute of Communication Engineering College of Electrical Engineering and Computer Science
National Taiwan University Master Thesis
差集生成網路–新穎資料生成
DifferenceSeeking Generative Adversarial Network–
Unseen Data Generation
宋易霖 YiLin Sung
指導教授:貝蘇章博士 Advisor: SooChang Pei, Ph.D.
中華民國 108 年 5 月
May, 2019
誌謝
我很感謝在碩士這兩年時光遇到的每個人,讓我發現即使論文投稿 一直失敗,這段旅程還是很充實的。
第一個要感謝的是貝蘇章教授,謝謝你當初願意收一個化工系的學 生進入實驗室,老師耐心的指導使我成長,勤奮不懈的身影是最好的 身教。畢業後我也會抱持著同樣的精神,努力成為一個在科技產業上 有貢獻的人,當然第一步還是希望能投稿成功…祝老師退而不休快樂!
我也要謝謝李宏毅教授,修了老師的兩門課讓我進入深度學習領域 的大門。後來擔任助教的時候很佩服老師對知識的堅持以及對實驗敏 銳的觀察。未來繼續在這條路上希望還能和老師有所交流。
這本論文的完成很大的部份必須歸功於謝松憲學長,除此之外,碩 二每週和學長討論絕對是讓我進步最大的原因。學長讓我了解數學基 礎對研究的重要性,也因此我修了幾門回想起來很痛苦,但收穫良多 的數學課。學長對於研究的想法和知識的細節也讓我非常欽佩。我們 一定要把這個研究成果投稿到 AI 的頂會上!
當然也要感謝實驗室的同學以及學弟妹。感謝你們罩我 DSP,不然 後果不堪設想。也謝謝你們包容我這個難熟的人,直到快畢業了才感 覺跟大家變熟,真的有點可惜。未來不知道會不會常見面,但如果需 要我而我有能力的話,我一定大力相挺。
最後要老套的感謝一下我的家人以及女友,謝謝你們一直相信一個 明明不太強的我。我會努力的朝自己的目標邁進,回饋社會也回饋你 們。
…寫於 2019 年 7 月 15 日 …
Acknowledgements
I’m glad to thank everyone I met during these two years.
摘要
新穎資料泛指那些不落在訓練資料的分佈中的資料,而他們在某些 應用是很重要的,如半監督學習、增強網路的穩定性和異常偵測等。
新穎資料通常難以取得,但是如果能夠有演算法能夠產生這些資料並 在訓練時使用,那麼將可以大幅增強模型。因此如何產生這些資料是 一個常見的研究議題。不同應用所需要的新穎資料往往不太相同,目 前針對各種應用也有不同的方法。在這篇論文中,我們提出一個演算 法差集生成對抗網路,能夠產生各種新穎資料。我們發現新穎資料所 在的分佈常常是兩個已知分佈的差集,而這兩個已知分佈的資料是比 較容易蒐集到的,甚至都可以從訓練資料變化而來。我們將差集對抗 網路應用在半監督學習、加強深度網路的穩定性以及異常偵測,實驗 結果證明我們的方法是有效的。除此之外,我們也提供理論的證明保 證演算法的收歛性。
關鍵字: 差集學習、生成對抗網路、半監督式學習、強健的深度網 路、異常偵測
Abstract
Unseen data, which are not samples from the distribution of training data and are difficult to collect, have exhibited the importance in many appli
cations (e.g., novelty detection, semisupervised learning, adversarial train
ing and so on.). In this paper, we introduce a general framework, called DifferenceSeeking Generative Adversarial Network (DSGAN), to create var
ious kinds of unseen data. The novelty is to consider the probability density of unseen data distribution to be the difference between those of two distri
butions pd¯and pd, whose samples are relatively easy to collect. DSGAN can learn the target distribution pt (or the unseen data distribution) via only the samples from the two distributions pdand pd¯. Under our scenario, pdis the distribution of seen data and pd¯can be obtained from pd via simple opera
tions, implying that we only need the samples of pd during training. Three key applications, semisupervised learning, increasing the robustness of neu
ral network and novelty detection, are taken as case studies to illustrate that DSGAN enables to produce various unseen data. We also provide theoretical analyses about the convergence of DSGAN.
Keywords: DifferenceSeeking, Generative Adversarial Network, SemiSupervised Learning, Robustness of Neural Network, Novelty Detection
Contents
誌謝 iii
Acknowledgements v
摘要 vii
Abstract ix
1 Introduction 1
2 Backgrounds 5
2.1 Deep Generative Model . . . 5
2.2 Generative Adversarial Network . . . 5
2.3 Wasserstein GAN . . . 6
2.4 SemiSupervised Learning with GANs . . . 8
2.5 Robust Issue of Neural Networks . . . 8
2.6 Novelty Detection by Reconstruction Method . . . 8
2.7 Related Works . . . 8
3 Proposed MethodDSGAN 11 3.1 Formulation . . . 11
3.2 Case Study on Synthetic Data and MNIST . . . 13
3.2.1 Case Study on Various Unseen Data Generation . . . 13
3.3 Discussions about the objective function of DSGAN . . . 15
3.4 Tricks for Stable Training . . . 16
3.5 Appendix: More Results for Case Study . . . 17
4 Theoretical Results 21 5 Applications 27 5.1 SemiSupervised Learning . . . 27
5.2 Robustness Enhancement of Deep Networks . . . 28
5.3 Novelty Detection . . . 30
6 Experiments 33 6.1 DSGAN in SemiSupervised Learning . . . 33
6.1.1 Datasets: MNIST, SVHN, and CIFAR10 . . . 34
6.1.2 Main Results . . . 35
6.1.3 Appendix: Experimental Details . . . 36
6.2 DSGAN in Robustness Enhancement of Deep Networks . . . 37
6.2.1 Experiments Settings . . . 40
6.2.2 Main Results . . . 40
6.2.3 Appendix: Experimental Details . . . 42
6.3 DSGAN in Novelty Detection . . . 43
6.3.1 Main Results . . . 44
6.3.2 Experimental Details . . . 46
7 Conclusions 47
Bibliography 49
List of Figures
1.1 Illustration of the differences between traditional GAN and DSGAN. . . . 3
2.1 The workflow of GAN. (Source: https://medium.freecodecamp.org/an
intuitiveintroductiontogenerativeadversarialnetworksgans7a2264a81394) 7
3.1 Complement points (in Green) between 2 circles (in Orange). . . 13 3.2 Boundary points (in Green) among 4 circles (in Orange). . . 13 3.3 The illustration about generating unseen data in boundary around train
ing data. First, the convolution of pd and normal distribution makes the density on boundary be no longer zero. Second, we seek pgsuch that Eq.
(3.1) holds, where the support set of pgis approximated by the difference of those between pd¯and of pd. . . 14 3.4 Illustration of differenceset seeking in MNIST. . . 15 3.5 DSGAN learns the difference between two sets. . . 15 3.6 Demonstrate the influence of α on the synthetic dataset. In this example,
pdis the orange rectangle (bounded by x <= 1, x >= −1, y >= −0.8 and y <= 0.8), and pd¯is the rectangle which is shifted pdright by 1 unit (not appear in the figures). We can observe that pgis farther away from pd
(green points) when α increases. When α is 0.5, pg learns perfect differ
ence between pd¯and pd. When α is 0.95, pggenerates the rightmost points of pd¯. The contour is the output of the discriminator, the place with higher score the generator going. Note that the outputs of the discriminator are not restricted in [0, 1], because we use WGAN’s structure in this experiment. 18
3.7 Extra 2D results for boundary sample generation. The orange points are data points, and the green points are generated points. . . 19 3.8 Difference set generation for CelebA dataset ([1]). pd¯ is 20000 images
from CelebA dataset. In pd¯, 1000 images are humans with glasses while others are ones without glasses. pd all contains human wearing glasses, and its size is 19000. In this case, our generator successfully learned to produce images which are human with glasses. Note that α = 0.95. . . . 20 5.1 Demonstration for the adversarial example. Adding a special noise to the
panda image can change the prediction of the model to “gibbon”. More
over, the noise on the adversarial example is unperceivable for human. . . 29 6.1 Accuracy of baseline and our models after attacks. Blue line indicates
the first baseline model. Orange, green and red lines denote the second baseline models with different ranges of uniform noise. Purple, brown and pink lines indicate our methods. In the legend, the float number (0.01, 0.03 and 0.05) also indicates the variance of noises, and “w1” means that w in (6.3) is set to 1. “epsilon” means the ℓ2 (or ℓinf) norm between the original image (pixel values are normalized to a range of [−0.5, 0.5]) and corresponding adversarial example. . . 41 6.2 The setting is the same with Fig. 6.3 unless w = 3. . . . 42 6.3 The setting is the same with Fig. 6.3 unless w = 10. . . . 42 6.4 Comparison of the reconstructed results of VAE and our method. Seen
class, which is at the bottom of the images, is car. Other rows are images from unseen classes. Our method exhibits a relatively larger gap, in terms of reconstruction error between seen data and unseen data, than VAE. . . 45
List of Tables
6.1 Semisupervised learning results on MNIST whether to use the sampling tricks. . . 35
6.2 Comparison of semisupervised learning between our DSGAN and state
oftheart methods: CatGAN [2], TripleGAN [3], FM [4], badGAN [5]
and CTGAN [6]. For a fair comparison, we only consider the GAN
based methods. ∗ indicates the use of the same architecture of classifier.
† indicates a larger architecture of classifier. ‡ indicates the use of data augmentation. The results for MNIST are recorded in the number of errors while the others are in percentage of errors. . . 36
6.3 Hyperparameters in semisupervised learning. . . 36
6.4 Network architectures for semisupervised learning on MNIST. (GN: Gaus
sian noise) . . . 38
6.5 The architectures of generator and discriminator for semisupervised learn
ing on SVHN and CIFAR10. N was set to 128 and 192 for SVHN and CIFAR10, respectively. . . 38
6.6 The architecture of classifiers for semisupervised learning on SVHN and CIFAR10. (GN: Gaussian noise, lReLU(leak rate): LeakyReLU(leak rate)) 39
6.7 The architecture of classifier for robustness enhancement of deep net
works on CIFAR10. (lReLU(leak rate): LeakyReLU(leak rate)) . . . 43
6.8 Comparison between our method (VAE+DSGAN) and stateoftheart meth
ods: VAE [7], AND [8], DSVDD [9], and OCGAN [10]. The results for Cifar10 were recorded in terms of AUC value. The number in the top row indicates the seen class, where 0: Plain, 1: Car, 2: Bird, 3: Cat, 4:
Deer, 5: Dog, 6: Frog,7: Horse, 8: Ship, 9: Truck. . . 45 6.9 The architectures of generator and discriminator in DSGAN for novelty
detection. . . 46 6.10 The architectures of VAE for novelty detection. . . 46
Chapter 1 Introduction
Unseen data are not samples from the distribution of training data and are difficult to collect. It has been demonstrated that the unseen samples can be applied to several ap
plications. [5] proposed how to create complement data and theoretically showed that complement data, considered as unseen data, can improve the semisupervised learning.
In novelty detection, [11] proposed a method to generate unseen data and used them to train a anomaly detector. Another related issue is adversarial training [12], where classifiers are trained to resist against adversarial examples, which are unseen during the training phase.
However, the aforementioned methods only focus on producing specific kind of unseen data instead of enabling to generate general types of unseen data.
In this paper, we propose a general framework, called DSGAN, to generate a vari
ety of unseen data. DSGAN is one of the generative approaches. In tradition, generative approaches, which are usually conducted in an unsupervised learning manner, are de
veloped for learning data distribution from its samples and thereafter produce novel and highdimensional samples, such as synthesized image, from learned distributions [13].
The stateoftheart approach is socalled Generative Adversarial Networks (GAN) [14].
GAN produces sharp images based on a gametheoretic framework, but can be tricky and unstable to train due to multiple interacting losses. Specifically, GAN consists of two functions: generator and discriminator. Both functions are represented as parameterized neural networks. The discriminator network is trained to classify whether or not inputs belong to the real data set or fake data set created by the generator. The generator learns to
map a sample from a latent space to some distribution to increase the classification errors of the discriminator.
Nevertheless, if we aim to learn a generator to create unseen data, traditional GAN requires preparing plenty of training samples of unseen classes for training, leading to the contradiction with the definition of unseen data. This fact motivates us to present DSGAN, which can generate unseen data by taking seen data as training samples (see Fig. 1.1, which illustrates the difference between GAN and DSGAN). The key idea is to consider the distribution of unseen data as the difference between two distributions, in which both are relatively easy to obtain. For example, outofdistribution examples in the MNIST dataset, from another point of view, are found to belong to the difference between the set of examples in MNIST and the universal set. It should be noted that the target distribution is equal to the training data distribution in traditional GAN; however, these two distributions, target distribution and training data distribution, are considered different in DSGAN.
In this paper, we make the following contributions:
(1) We propose DSGAN to generate any unseen data only if the density of target (unseen data) distribution is the difference between those of any two distributions, pd¯and pd. By contrast, traditional GAN fails to learn the difference between two distributions.
(2) We show that DSGAN possesses the flexibility to learn different target (unseen data) distributions in three key applications, semisupervised learning, increasing the robustness of neural network and novelty detection. Specifically, for novelty detection, DSGAN can produce boundary points around seen data because this kind of unseen data is easily misclassified. DSGAN also generates boundary samples to increase the robustness of neural network, but the distance measured in ℓinf norm.
For semisupervised learning, unseen data are the linear combination of any labeled data and unlabeled data, excluding labeled and unlabeled data themselves1.
(3) Our theoretical analysis shows that, with enough capacity of the generator and the
1The linear combination of any labeled data and unlabeled data probably belongs to the set of seen data (labeled data and unlabeled data), which contradicts with the definition of unseen data. Thus, the samples generated by DSGAN should not include seen data themselves.
discriminator, the generator can learn the target distribution pt, whose support set is the difference of support sets between pd¯and pd, under mild conditions.
Figure 1.1: Illustration of the differences between traditional GAN and DSGAN.
Chapter 2 Backgrounds
2.1 Deep Generative Model
A deep generative model is to learn the underlying data distribution PX from limited train
ing dataX via neural network. The learned data distribution can be apply to several sce
narios, e.g. classification problem, representation learning, compressed sensing, etc. One can view that to learn a generative model is similar to teach machine understanding the world. Recently, there are some wellknown algorithms for deep generative models, in
cluding Variational Autoencoders (VAE) ([7]), Generative Adversarial Networks (GAN) ([14]), autoregressive models ([15]), and normalizing flow models ([16]). In this thesis, I focus on studying GANs.
2.2 Generative Adversarial Network
GAN is one of the popular framework in generation in recent years and it is successfully apply to various applications ([17], [18], [19]). GAN sets up a minmax game between the generator and discriminator. Generator tries to learn the data distribution to fool the discriminator while the discriminator learns to distinguish the input coming from the data distribution or the generator.
To learn the generator, one has to define a prior distribution pz of the input variable z, and pz isU(0, 1) or N(0, 1) in most of the time. Generator is a differentiable mapping
function G with output space G(z; θg). The goal of the generator is to find a optimal θg to let the distribution of G(z; θg) (that is, pg) equal to the data distribution pX. The discriminator is define as D(x; θd) that outputs scalar value. D(x) can be viewed as the probability of that x is from pX. In other words, the discriminator is a binary classifier to distinguish the input is from pX or pg. The workflow of GAN is demonstrate in Fig. 2.1.
One train D to maximize the probability that assign 1 to the x and 0 to G(z) (or xg).
In the mean while, G is being trained to maximize D(xg). This procedure can be treated as a minmax game between G and D. More specifically, the objective function of GAN is
F (G, D) =Ex∼pd[log (D(x))] +Ez∼pz[log (1− D(G(z)))] (2.1) And one can optimize 2.1 by updating G and D, that is
minG max
D F (G, D) .
Under mild assumptions, optimizing 2.1 equals to minimize the JensenShannon di
vergence between pX and pg. The global optimum is reached only if pg = pX. The detailed proof is in [14].
In 2.1, the generator is going to be minimized until the discriminator is at optimal.
However, this training procedure is inefficient. Therefore, [14] claims that the discrim
inator only need constant multiple times (e.g. 5) updating steps per generator updating step. [20] points out that use higher learning rate for discriminator also achieve similar result. With this technique, we can alternatively train G and D each for one iteration, and it can shorten the training time.
2.3 Wasserstein GAN
It is demonstrated that there are some problems in training GAN: unstable training process, gradient vanishing problem, mode collapse in generator, etc. Lots of works are trying to address those issues. [21] proposed the convolutional architectures for both generator and
Figure 2.1: The workflow of GAN. (Source: https://medium.freecodecamp.org/an
intuitiveintroductiontogenerativeadversarialnetworksgans7a2264a81394)
discriminator to stabilize training. [22] define an alternative loss function and the proposed GAN can not only generate high quality images but also have more stable training process.
[23] mathematically analyze the training dynamics of training GAN. Moreover, they state that the unstable issues of GAN comes from the objective function. Based on [24], [25]
proposed Wasserstein GAN with a modified objective function,
W (G, D) =Ex∼pd[D (x)]− Ez∼pz[D (G (z))] (2.2)
Same as what we do in training GAN, we update G and D to minimize and maximize 2.2 respectively, that is,
minG max
∥D∥L≤1W (G, D)
where∥D∥L≤ 1 denotes that D meets the 1Lipschitz continuity.
Training WGAN can be viewed as minimizing the Wasserstein distance (EarthMover distance) between pX and pg. WGAN is exhibited to be more stable and it can cover more modes of the training data. The reason why Wasserstein distance works better than JensenShannon divergence in training GAN refer to [24] and [25].
In order to let WGAN success, one has to constrain the Lipschitz continuity of the discriminator. In [25], the author use the weight clipping to enforce the condition. How
ever, the method may lead some undesired behavior. [26] instead constrain the gradient norm of the discriminator’s output with respect to its input, and the proposed GAN called WGANGP (GP is the abbreviation of gradient penalty).
2.4 SemiSupervised Learning with GANs
Please refer to Sec. 5.1.
2.5 Robust Issue of Neural Networks
Please refer to Sec. 5.2.
2.6 Novelty Detection by Reconstruction Method
Please refer to Sec. 5.3.
2.7 Related Works
We introduce related works about generating unseen data.
[11] proposed a method to generate samples of unseen classes in the unsupervised manner via an adversarial learning strategy. But, it requires solving an optimization prob
lem for each sample, which undoubtedly lead to high computation cost. On the contrary, DSGAN has the capability to create infinite diverse unseen samples. [27] presented a new GAN architecture that can learn both distributions of unseen data from part of seen data and unlabeled data. But, unlabeled data must be a mixture of seen and unseen samples;
DSGAN does not require any unseen data instead. [5] aims to generate complementary samples (or outofdistribution samples) but assumes that indistribution can be estimated by a pretrained model such as PixelCNN++, which might be difficult and expensive to train. [28] uses a simple classifier to replace the role of PixelCNN++ in [5] such that the training is much easier and more suitable. Nevertheless, their method only focuses on generates unseen data surrounding the lowdensity area of seen data, but DSGAN is more flexible to generate different kinds of unseen data (e.g., the linear combination of seen data described in Sec.6.1). In addition, their method needs the label information of data while ours is fully unsupervised.
Related works about semisupervised learning and enhancing robustness of neural net
work are presented in Sec. 5
Chapter 3
Proposed MethodDSGAN
3.1 Formulation
We denote the generator distribution as pg and training data distribution as pd, both in a N dimensional space. Let pd¯be the distribution decided by user. For example, pd¯can be the convolution of pdand normal distribution. Let pt be the target distribution which the user is interested in, and it can be expressed as
(1− α)pt(x) + αpd(x) = pd¯(x), (3.1)
where α ∈ [0, 1]. Our method, DSGAN, aims to learn pg such that pg = pt. Note that if the support set of pdbelongs to that of pd¯, then there exists at least an α such that the equality in (3.1) holds. However, even though the equality does not hold, intuitively, DSGAN tries to learn pgsuch that pg(x)∼ pd¯(x)− αpd(x)
1− α with the constraint pg(x)≥ 0.
In other words, the generator is going to output samples located in highdensity areas of pd¯− αpd. Furthermore, we show that DSGAN can learn pg, whose support set is the difference between those of pd¯and pdin Proposition 2.
At first, we formulate the generator and discriminator in GANs. The inputs z of the generator are drawn from pz(z) in an M dimensional space. The generator function G(z; θg) : RM → RN represents a mapping to data space, where G is a differentiable function with parameters θg. The discriminator is defined as D (x; θd) :RN → [0, 1] that
outputs a single scalar. D (x) can be considered as the probability that x belongs to a class of real data.
Similar to traditional GAN, we train D to distinguish the real data from the fake data sampled from G. Meanwhile, G is trained to produce realistic data as possible to mislead D. But, in DSGAN, the definitions of “real data” and “fake data” are different from those in traditional GAN. The samples from pd¯are considered as real but those from the mixture distribution between pdand pg are considered as fake. The objective function is defined as follows:
V (G, D) :=Ex∼pd¯(x)[log D(x)] + (1− α)Ez∼pz(z)[log (1− D (G (z)))] + αEx∼pd(x)[log (1− D(x))] .
(3.2)
We optimize (3.2) through a minmax game between G and D, that is,
minG max
D V (G, D) .
During the training procedure, an iterative approach like traditional GAN is to alternate between k steps of training D and one step of training G. In practice, minibatch stochastic gradient descent via back propagation is used to update θdand θg. In other words, for each of pg, pdand pd¯, m sample are required for computing gradients, where m is the number of samples in a minibatch. The training procedure is illustrated in Algorithm 1. DSGAN suffers from the same drawbacks with traditional GAN (e.g., mode collapse, overfitting, and strong discriminator) such that the generator gradient vanishes. There are literature [4, 24, 29] focusing on dealing with the above problems, and such ideas can be readily combined into DSGAN.
[3] and [30] proposed the similar objective function like (3.2). Their goal is to learn the conditional distribution of training data. Nevertheless, we aim to learn the target dis
tribution ptin Eq. (3.1), not the training data distribution.
Algorithm 1 The training procedure of DSGAN using minibatch stochastic gradient de
scent. k is the number of steps applied to discriminator. α is the ratio between pg and pd in the mixture distribution. We used k = 1 and α = 0.8 in experiments.
01. for number of training iterations do 02. for k steps do
03. Sample minibatch of m noise samples z(1), ..., z(m) from pg(z).
04. Sample minibatch of m samples x(1)d , ..., x(m)d from pd(x).
05. Sample minibatch of m samples x(1)d¯ , ..., x(m)d¯ from pd¯(x).
06. Update the discriminator by ascending its stochastic gradient:
∇θd
"
1 m
Xm i=1
log D
x(i)d
+ log 1− D G z(i)
+ log
1− D x(i)d¯
i
07. end for
08. Sample minibatch of m noise samples z(1), ..., z(m) from pg(z).
09. Update the generator by descending its stochastic gradient:
∇θg
1 m
Xm i=1
log 1− D G z(i)
10. end for
3.2 Case Study on Synthetic Data and MNIST
3.2.1 Case Study on Various Unseen Data Generation
To get more intuitive understanding about DSGAN, we conduct several case studies on 2D synthetic datasets and MNIST. α = 0.8 in Eq. (3.1) is used.
Figure 3.1: Complement points (in Green) between 2 circles (in Orange).
Figure 3.2: Boundary points (in Green) among 4 circles (in Orange).
Complement samples generation Fig. 3.1 illustrates that DSGAN is able to generate complement samples between 2 circles. Given the density function of the 2 circles as
Figure 3.3: The illustration about generating unseen data in boundary around training data.
First, the convolution of pdand normal distribution makes the density on boundary be no longer zero. Second, we seek pg such that Eq. (3.1) holds, where the support set of pg is approximated by the difference of those between pd¯and of pd.
pd, we assign samples drawn from pd¯as the linear combinations of 2 circles. Then, by applying DSGAN, we achieve our goal to generate complement samples. In fact, this kind of unseen data is used in semisupervised learning.
Boundary samples generation Fig. 3.2 illustrates that DSGAN generates boundary points among 4 circles. This kind of unseen data is used in novelty detection. In this case, we assign pdand pd¯as “the density function of 4 circles” and “the convolution of pdand the normal distribution,” respectively. The intuition of our idea is also illustrated by an 1D example in Fig. 3.3.
Differenceset generation We also validate DSGAN on high dimensional dataset such as MNIST. In this example, we define pdto be the distribution of digit “1” and pd¯to be the distribution containing both digits “1” and “7”. Since the density pd(x) is high when x is digit “1,” the generator is prone to output digit “7” with high probability. The illustration of differenceset generation is demonstrated in Fig. 3.4 and 3.5.
From the above results, we can observe two properties of generator distribution pg: i) the higher density of pd(x), the lower density of pg(x); ii) pg prefers to output samples from highdensity areas of pd¯(x)− αpd(x).
In the next section, we will show that the objective function is equivalent to minimizing the JensenShannon divergence between the mixture distribution (pdand pg) and pd¯if G and D are given enough capacity.
1
7
1
7
Figure 3.4: Illustration of differenceset seeking in MNIST.
Figure 3.5: DSGAN learns the difference between two sets.
3.3 Discussions about the objective function of DSGAN
There are two main issues in 3.1. The first is that how the α influence the learned distri
bution pg. From the objective function, one can imagine that the larger α will reduce the overlap between pdand pg. However, will the pg be really far away from pdif α is close to 1? Depending on the design of pd¯, the answer can be yes or no. Second, in some cases, it is possible that pdis not fully contained in pd¯. In other words, pd¯(x)− αpd(x)
1− α can be negative for some x when α is large enough. In this section, we are going to demonstrate that the negative part will not influence the learning of generator.
We will discuss about the issues using Fig. 3.6.
The influence of α In Fig. 3.6, the overlapped area between pd¯and pdis 0.5 unit (let the area of pd is 1 unit), so α = 0.5 is the smallest choice to let pg is disjoint to pd. As the α = 0.8, the generated points locate at the place which has a gap to yellow points. In theory, the result of α = 0.8 should be the same as that of α = 0.5, since the discriminator should give the whole area which is outside pdbut inside pd¯same score. However, due to the continuity of the discriminator (continuity is the key point to make GAN work, such as
WGAN), the score of the area just beside the pdis lower than where is far from it, when one has larger α. Because of this reason, the pgtend to keep away from pd. At α equals to 0.95, we can find that pg still locate inside pd¯. In pd¯(x)− αpd(x)
1− α , there must be some points making the expression are positive (when pd¯̸= pd), due toR
xp(x)dx = 1. Therefore, pg will generate such points. In this example, one can observe that pg is bounded by pd¯. As the support of pd¯is much larger than which of pd, excessive α will let pgstay away from pd. On the other side, one can use smaller support of pd¯to make pg close to pd.
Negative density One can figure out that pdis not fully contained in pd¯in this example, while the rectangle which is bounded by x <= 0, x >=−1, y >= −0.8 and y <= 0.8, are in pdbut not in pd¯. However, in the case which α = 0.5, we notice that the generator still generates perfect difference set even though there are some deficient places with neg
ative density. This can be explained through the intrinsic property of the discriminator.
By 3.2, the discriminator’s output of the points with negative density (x′) tend to be 0.
Observing the first term in 3.2, since x′ is not in pd¯, then no gradients will lead D(x′) to arise to 1. D(x′) = 0 meets our goal that pg don’t overlap with pd. Therefore, although there exists the negative density, the objective will not be effected.
3.4 Tricks for Stable Training
We provide a trick to stabilize the training procedure by reformulating the objective func
tion. Specifically, V (G, D) in (3.2) is reformulated as:
V (G, D) = Z
x
pd¯(x) log (D (x)) + ((1− α)pg(x) + αpd(x)) log (1− D (x)) dx
=Ex∼pd¯(x)[log D(x)] +Ex∼(1−α)pg(x)+α∼pd(x)[log (1− D (x))] .
(3.3)
Instead of sampling a minibatch of m samples from pzand pdin Algorithm 1, (1−α)m and αm samples from both distributions are required, respectively. The computation cost in training can be reduced due to fewer samples. Furthermore, although (3.3) is equiva
lent to (3.2) in theory, we find that the training using (3.3) achieves better performance
than using (3.2) via empirical validation in Table 6.1. We conjecture that the equivalence between (3.3) and (3.2) is based on the linearity of expectation, but minibatch stochastic gradient descent in practical training may lead to the different outcomes.
3.5 Appendix: More Results for Case Study
Additional results for boundary samples generation and difference set generation are pre
sented in Fig. 3.7 and Fig. 3.8, respectively.
(a) α = 0.30 (b) α = 0.50
(c) α = 0.80 (d) α = 0.95
Figure 3.6: Demonstrate the influence of α on the synthetic dataset. In this example, pd is the orange rectangle (bounded by x <= 1, x >=−1, y >= −0.8 and y <= 0.8), and pd¯is the rectangle which is shifted pd right by 1 unit (not appear in the figures). We can observe that pg is farther away from pd(green points) when α increases. When α is 0.5, pglearns perfect difference between pd¯and pd. When α is 0.95, pggenerates the rightmost points of pd¯. The contour is the output of the discriminator, the place with higher score the generator going. Note that the outputs of the discriminator are not restricted in [0, 1], because we use WGAN’s structure in this experiment.
(a) S shape with compact data points. α = 0.9. (b) S shape with scattering data points. α = 0.9.
(c) 4 Gaussians. α = 0.9. (d) 8 Gaussians. α = 0.8.
Figure 3.7: Extra 2D results for boundary sample generation. The orange points are data points, and the green points are generated points.
Figure 3.8: Difference set generation for CelebA dataset ([1]). pd¯is 20000 images from CelebA dataset. In pd¯, 1000 images are humans with glasses while others are ones without glasses. pd all contains human wearing glasses, and its size is 19000. In this case, our generator successfully learned to produce images which are human with glasses. Note that α = 0.95.
Chapter 4
Theoretical Results
There are two assumptions for subsequent proofs. First, in a nonparametric setting, we assume both generator and discriminator have infinite capacity. Second, pg is defined as the distribution of the samples drawn from G(z) under z ∼ pz. We will first show the optimal discriminator given G and then show that minimizing V (G, D) via G given the optimal discriminator is equivalent to minimizing the JensenShannon divergence between (1− α)pg+ αpdand pd¯.
Proposition 1. For G being fixed, the optimal discriminator D is
D∗G(x) = pd(x)
pd(x) + (1− α)pg(x) + αpd(x).
Proof. Given any generator G, the training criterion for the discriminator D is to maximize the quantity V (G, D):
V (G, D) = Z
x
pd¯(x) log (D (x)) dx + (1− α) Z
z
pz(z) log (1− D (G (z))) dz + α
Z
x
pd(x) log (1− D (x)) dx
= Z
x
pd¯(x) log (D (x)) dx + (1− α) Z
x
pg(x) log (1− D (x)) dz + α
Z
x
pd(x) log (1− D (x)) dx
= Z
x
pd¯(x) log (D (x)) + ((1− α)pg(x) + αpd(x)) log (1− D (x)) dx.
For any (a, b)∈ R2\{0, 0}, the function a log (y)+b log (1 − y) achieves its maximum in [0, 1] at y = a+ba . The discriminator only needs to be defined within Supp(pd¯)S
Supp(pd)S
Supp(pg).
We complete this proof.
Moreover, D can be considered to discriminate between samples from pd¯and those from ((1− α)pg(x) + αpd(x)). By replacing the optimal discriminator into V (G, D), we obtain
C(G) = max
D V (G, D)
=Ex∼pd¯(x)[log DG∗(x)] + (1− α)Ez∼pz(z)[log (1− D∗G(G (z)))]
+ αEx∼pd(x)[log (1− DG∗(x))]
=Ex∼pd¯(x)[log DG∗(x)] +Ex∼p∗(x)[log (1− D∗G(x))]
=Ex∼pd¯(x)
log pd¯(x)
pd¯(x) + (1− α)pg(x) + αpd(x)
+Ex∼p∗(x)
log (1− α)pg(x) + αpd(x) pd¯(x) + (1− α)pg(x) + αpd(x)
,
(4.1)
where p∗(x) = (1−α)pg(x) + αpd(x) and the third equality holds because of the linearity of expectation.
Actually, the results so far show the optimal solution of D given G being fixed in (4.1).
Now, the next step is to find the optimal G with D∗Gbeing fixed.
Theorem 1. The global minimum of C(G) is achieved if and only if (1− α)pg(x) + αpd(x) = pd(x)¯ for all x’s. At that point, C(G) achieves the value− log 4.
Proof. We start from
(4.1) =− log(4) +Ex∼pd¯(x)
log 2pd¯(x)
pd¯(x) + (1− α)pg(x) + αpd(x)
+Ex∼p∗(x)
log 2 ((1− α)pg(x) + αpd(x)) pd¯(x) + (1− α)pg(x) + αpd(x)
=− log(4) + KL
pd¯
pd¯+ (1− α)pg+ αpd 2
+ KL
(1− α)pg(x) + αpd
pd¯+ (1− α)pg+ αpd 2
=− log(4) + 2 JSD (pd¯∥ (1 − α)pg + αpd) ,
where p∗(x) = (1− α)pg(x) + αpd(x), KL is the KullbackLeibler divergence and JSD is the JensenShannon divergence. The JSD returns the minimal value, which is 0, iff both distributions are the same, namely pd¯ = (1− α)pg + αpd. Because pg(x)’s are always nonnegative, it should be noted both distributions are the same only if αpd(x) ≤ pd¯(x) for all x’s. We complete this proof.
Note that (1−α)pg(x)+αpd(x) = pd(x)¯ may not hold if αpd(x) > pd(x)¯ . But, DSGAN still works based on two facts: i) given D, V (G, D) is a convex function in pgand ii) due to
Z
x
pg(x)dx = 1, the set collecting all feasible solutions of pgis convex. In other words, there always exists a global minimum of V (G, D) given D, but it may not be− log(4).
In the following, we show that the support set of pg is contained within the difference of support sets between pd¯and pdwhile achieving the global minimum such that we can generate the desired pg by designing appropriate pd¯.
Proposition 2. Suppose αpd(x) ≥ pd¯(x) for all x ∈ Supp(pd) and all density functions pd(x), pd¯(x) and pg(x) are continuous. If the global minimum of C(G) is achieved, then
Supp (pg)⊆ Supp (pd¯)− Supp(pd).
Proof. Recall
C(G) = Z
x
pd¯(x) log
pd¯(x)
pd¯(x) + (1− α)pg(x) + αpd(x)
+ p∗(x) log
(1− α)pg(x) + αpd(x) pd¯(x) + (1− α)pg(x) + αpd(x)
dx
= Z
x
S(pg; x)dx
= Z
x∈Supp(pd¯)−Supp(pd)
S(pg; x)dx + Z
x∈Supp(pd)
S(pg; x)dx
S(pg; x) is to simplify the notations inside the integral. For any x, S(pg; x) in pg(x) is nonincreasing and S(pg; x)≤ 0 always holds. Specifically, S(pg; x) is decreasing along the increase of pg(x) if pd¯(x) > 0; S(pg; x) attains the maximum value, zero, for any pg(x) if pd¯(x) = 0. Since DSGAN aims to minimize C(G) with the constraint
Z
x
pg(d)dx = 1, the solution attaining the global minima must satisfy pg(x) = 0 if pd¯(x) = 0; otherwise, there exists another solution with smaller value of C(G). Thus, Supp (pg)⊆ Supp (pd¯).
Furthermore, T (pg; x) = ∂S(pg; x)
∂pg(x) = log
(1− α)pg(x) + αpd(x) pd¯(x) + (1− α)pg(x) + αpd(x)
, which is expected to be as small as possible to minimize C(G), is increasing on pg(x) and con
verges to 0. Then, we show that T (pg; x) for x ∈ Supp(pd¯)T
Supp(pd) is always larger than that for x∈ Supp(pd¯)− Supp(pd) for all pg. Specifically,
1. When x∈ Supp(pd¯)T
Supp(pd), T (pg; x)≥ log12 always holds due to the assump
tion αpd(x)≥ pd¯(x).
2. When x ∈ Supp(pd¯)− Supp(pd), T (pg; x) < log12 for all pg(x)’s satisfying (1− α)pg(x)≤ pd(x)¯ .
Thus, the minimizer prefers pg(x) > 0 for x ∈ Supp(pd¯)− Supp(pd) and (1− α)pg(x)≤ pd(x)¯ . We check whether there exists a solution pg such that (1− α)pg(x) ≤ pd(x)¯ and Z
x∈Supp(pd¯)−Supp(pd)
pg(d)dx = 1, implying pg(x) = 0 for x ∈ Supp(pd¯)T
Supp(pd).
Based on the following expression, Z
x∈Supp(pd¯)−Supp(pd)
pd¯(x)dx + Z
x∈Supp(pd)
pd¯(x)dx = 1
⇒ Z
x∈Supp(pd¯)−Supp(pd)
pd¯(x)dx ≥ 1 − Z
x∈Supp(pd)
αpd(x)dx
⇒ Z
x∈Supp(pd¯)−Supp(pd)
pd¯(x)dx ≥ 1 − α
⇒ Z
x∈Supp(pd¯)−Supp(pd)
pd¯(x)dx ≥ Z
x∈Supp(pd¯)−Supp(pd)
(1− α)pg(x)dx,
the last inequality implies that there must exist a feasible solution. We complete this proof.
In sum, the generator is prone to output samples located in highdensity areas of pd¯− αpd.
Another concern is the convergence of Algorithm 1.
Proposition 3. The discriminator reaches its optimal value given G in Algorithm 1, and pg is updated by minimizing
Ex∼pd¯(x)[log D∗G(x)] +Ex∼p∗(x)[log (1− DG∗ (x))] .
If G and D have enough capacity, then pgconverges to argmin
pg
JSD (pd¯∥ (1 − α)pg+ αpd).
Proof. Consider V (G, D) = U (pg, D) as a function of pg. By the proof idea of Propo
sition 2 in [14], if f (x) = supα∈Afα(x) and fα(x) is convex in x for every α, then
∂fβ(x) ∈ ∂f if β = argsupα∈Afα(x). In other words, if supDV (G, D) is convex in pg, the subderivatives of supDV (G, D) includes the derivative of the function at the point, where the maximum is attained, implying the convergence with sufficiently small updates of pg. We complete this proof.
Chapter 5 Applications
DSGAN have been applied to three problems: semisupervised learning, robustness en
hancement of deep networks and novelty detection. As for semisupervised learning, DS
GAN acts as a “bad generator,” which creates complement samples in the feature space of real data, while DSGAN generates adversarial examples located in the lowdensity areas of training data for robustness enhancement. For novelty detection, DSGAN generates samples (unseen data) as the boundary points around training data.
5.1 SemiSupervised Learning
Semisupervised learning (SSL) is a kind of learning model with the use of a small number of labeled data and a large amount of unlabeled data. The existing SSL methods based on generative model (e.g., VAE [31] and GAN [4]) obtain good empirical results. [5] theo
retically shows that good semisupervised learning requires a bad GAN with the objective function:
max
D Ex,y∼Llog PD(y | x, y ≤ K) + Ex∼pd(x)log PD(y ≤ K | x) +Ex∼pg(x)log Pg(K + 1 | x) ,
(5.1)
where (x, y) denotes a pair of data and its corresponding label,{1, 2, . . . , K} denotes the label space for classification, and L = {(x, y)} is the label dataset. Moreover, in the
semisupervised settings, pd in (5.1) is the distribution of unlabeled data. Note that the discriminator D in GAN also plays the role of classifier. If the generator distribution exactly matches the real data distribution (i.e., pg = pd), then the classifier trained by the objective function (5.1) with the unlabeled data cannot have better performance than that trained by supervised learning with the objective function:
maxD Ex,y∼Llog PD(y | x, y ≤ K) . (5.2)
On the contrary, the generator is preferred to generate complement samples, which lie on the lowdensity area of pd. Under some mild assumptions, those complement samples help D to learn correct decision boundaries in the lowdensity area because the probabil
ities of true classes are forced to be low on outofdistribution areas.
The complement samples in [5] are complicate to produce. We will demonstrate that DSGAN is easy to generate complement samples in Sec. 6.
5.2 Robustness Enhancement of Deep Networks
Deep neural networks have impacted on our daily life. Neural networks, however, are vulnerable to adversarial examples, as evidenced in recent studies [32, 33]. Thus, there has been significant interest in how to enhance the robustness of neural networks. Un
fortunately, if the adversary has full access to the network, namely whitebox attack, a complete defense strategy has not yet been found.
In the research papers, it’s not hard to see a well trained deep neural model reaching more than 90% accuracy on a classification task. It seems that the machine beats human in recognition nowadays. Nonetheless, the machine vision is surprisingly fragile. For most of the inputs which are classified correctly, one can constructed an adversarial example by adding a specific noise to original input, to make the predicted results totally wrong.
In most of the time, the original image and its adversarial example are undistinguished for human, such as Fig. 5.1.
One can create an adversarial example through 5.3, where p is 2 or inf, typically. Intu
Figure 5.1: Demonstration for the adversarial example. Adding a special noise to the panda image can change the prediction of the model to “gibbon”. Moreover, the noise on the adversarial example is unperceivable for human.
itively, optimizing the equation is to find a perturbation for the input to let the model less likely to predict the correct label.
∥δ∥maxp≤ϵℓ (x + δ; y; Cθ) (5.3)
[34] surveyed the stateoftheart defense strategies and showed that adversarial train
ing [35] is more robust than other strategies. Given adversarial examples and a trained classifier C parameterized with θ and a loss function ℓ (x; y; Cθ), adversarial training solves a minmax game, where the first step is to find adversarial examples within ϵball for maximizing the loss, and the second step is to train the model for minimizing the loss.
Specifically, the objective in [35] is
argmin
θ
E(x,y)∼L
max
δ∈[−ϵ, ϵ]Nℓ (x + δ; y; Cθ)
. (5.4)
The authors used projected gradient descent (PGD) to find adversarial examples by max
imizing the inner optimization.
Adversarial training requires the pretrained classifiers to calculate PGD such that ad
versarial examples may be effective on specific classifiers. But, our DSGAN generates unseen data in the lowdensity area, which include adversarial examples, because [36]
pointed out that adversarial examples frequently locate in the lowdensity area. In other
words, the generated data in DSGAN are more universal but less effective for specific classifiers. In Sec. 6.2, we enhance the robustness of deep learning networks in a semi
supervised manner, which use generated samples of DSGAN to finetune Cθ. ϵball in terms of ℓ2or ℓinfcan be intuitively incorporated into the generation of adversarial exam
ples.
5.3 Novelty Detection
Novelty detection determines if a query example is from one seen class. If the samples of one seen class are considered as positive data, then this difficulty is the absence of neg
ative data in the training phase such that supervised learning cannot work. To overcome this problem, one of classical methods is the OneClass SVM (OCSVM) [37] that only requires positive data as training inputs. However, OCSVM often suffers from the curse of dimensionality due to bad computational scalability.
Recently, novelty detection has made a great progress with the advent of deep leaning.
[38][39] focus on learning a representative latent space for the one seen class. When testing, the query image is projected onto the learned latent space. Then, the difference between the query image and its inverse image (reconstruction) is measured. In other words, all we need is to train an encoder for projection and a decoder for reconstruction.
Under the circumstance, autoencoder (AE) usually is adopted to learn both encoder and decoder [38][10]. Let Enc(·) be the encoder and Dec(·) be the decoder, respectively. The loss function of AE is defined as:
min
Enc,DecEx∼ppos(x)
∥x − Dec(Enc(x))∥22
, (5.5)
where ppos is the distribution of one seen class. After training, a query example xtest is classified as the seen class if
∥xtest− Dec(Enc(xtest))∥22 ≤ τ, (5.6)