Appendix: More Results for Case Study

Additional results for boundary samples generation and difference set generation are pre

sented in Fig. 3.7 and Fig. 3.8, respectively.

(a) α = 0.30 (b) α = 0.50

(c) α = 0.80 (d) α = 0.95

Figure 3.6: Demonstrate the influence of α on the synthetic dataset. In this example, p_d is the orange rectangle (bounded by x <= 1, x >=−1, y >= −0.8 and y <= 0.8), and pd¯is the rectangle which is shifted p_d right by 1 unit (not appear in the figures). We can observe that p_g is farther away from p_d(green points) when α increases. When α is 0.5, p_glearns perfect difference between pd¯and p_d. When α is 0.95, p_ggenerates the rightmost points of pd¯. The contour is the output of the discriminator, the place with higher score the generator going. Note that the outputs of the discriminator are not restricted in [0, 1], because we use WGAN’s structure in this experiment.

(a) S shape with compact data points. α = 0.9. (b) S shape with scattering data points. α = 0.9.

(c) 4 Gaussians. α = 0.9. (d) 8 Gaussians. α = 0.8.

Figure 3.7: Extra 2D results for boundary sample generation. The orange points are data points, and the green points are generated points.

Figure 3.8: Difference set generation for CelebA dataset ([1]). pd¯is 20000 images from CelebA dataset. In pd¯, 1000 images are humans with glasses while others are ones without glasses. p_d all contains human wearing glasses, and its size is 19000. In this case, our generator successfully learned to produce images which are human with glasses. Note that α = 0.95.

Chapter 4 Theoretical Results

There are two assumptions for subsequent proofs. First, in a nonparametric setting, we assume both generator and discriminator have infinite capacity. Second, p_g is defined as the distribution of the samples drawn from G(z) under z ∼ pz. We will first show the optimal discriminator given G and then show that minimizing V (G, D) via G given the optimal discriminator is equivalent to minimizing the JensenShannon divergence between (1− α)pg+ αp_dand pd¯.

Proposition 1. For G being fixed, the optimal discriminator D is

D^∗_G(x) = p_d(x)

p_d(x) + (1− α)pg(x) + αp_d(x).

Proof. Given any generator G, the training criterion for the discriminator D is to maximize the quantity V (G, D):

For any (a, b)∈ R²\{0, 0}, the function a log (y)+b log (1 − y) achieves its maximum in [0, 1] at y = _a+b^a . The discriminator only needs to be defined within Supp(pd¯)S

Supp(p_d)S

Supp(p_g).

We complete this proof.

Moreover, D can be considered to discriminate between samples from pd¯and those from ((1− α)pg(x) + αp_d(x)). By replacing the optimal discriminator into V (G, D), we

Actually, the results so far show the optimal solution of D given G being fixed in (4.1).

Now, the next step is to find the optimal G with D^∗_Gbeing fixed.

Theorem 1. The global minimum of C(G) is achieved if and only if (1− α)pg(x) + αp_d(x) = pd(x)¯ for all x’s. At that point, C(G) achieves the value− log 4.

Proof. We start from the JensenShannon divergence. The JSD returns the minimal value, which is 0, iff both distributions are the same, namely pd¯ = (1− α)pg + αp_d. Because p_g(x)’s are always nonnegative, it should be noted both distributions are the same only if αp_d(x) ≤ pd^¯(x) for all x’s. We complete this proof.

Note that (1−α)pg(x)+αp_d(x) = pd(x)¯ may not hold if αp_d(x) > pd(x)¯ . But, DSGAN still works based on two facts: i) given D, V (G, D) is a convex function in p_gand ii) due to

p_g(x)dx = 1, the set collecting all feasible solutions of p_gis convex. In other words, there always exists a global minimum of V (G, D) given D, but it may not be− log(4).

In the following, we show that the support set of p_g is contained within the difference of support sets between pd¯and pdwhile achieving the global minimum such that we can generate the desired p_g by designing appropriate pd¯.

Proposition 2. Suppose αpd(x) ≥ pd^¯(x) for all x ∈ Supp(pd) and all density functions p_d(x), pd¯(x) and p_g(x) are continuous. If the global minimum of C(G) is achieved, then

Supp (p_g)⊆ Supp (pd^¯)− Supp(pd).

Proof. Recall nonincreasing and S(pg; x)≤ 0 always holds. Specifically, S(pg; x) is decreasing along the increase of pg(x) if pd¯(x) > 0; S(pg; x) attains the maximum value, zero, for any pg(x) if pd¯(x) = 0. Since DSGAN aims to minimize C(G) with the constraint

p_g(d)dx = 1, the solution attaining the global minima must satisfy p_g(x) = 0 if pd¯(x) = 0; otherwise, there exists another solution with smaller value of C(G). Thus, Supp (p_g)⊆ Supp (pd^¯).

Furthermore, T (p_g; x) = ∂S(p_g; x) is expected to be as small as possible to minimize C(G), is increasing on p_g(x) and con

verges to 0. Then, we show that T (p_g; x) for x ∈ Supp(pd^¯)T

Supp(p_d) is always larger than that for x∈ Supp(pd^¯)− Supp(pd) for all p_g. Specifically,

Based on the following expression,

the last inequality implies that there must exist a feasible solution. We complete this proof.

In sum, the generator is prone to output samples located in highdensity areas of pd¯− αp_d.

Another concern is the convergence of Algorithm 1.

Proposition 3. The discriminator reaches its optimal value given G in Algorithm 1, and pg is updated by minimizing

Ex∼pd^¯(x)[log D^∗_G(x)] +Ex∼p^∗(x)[log (1− DG^∗ (x))] .

If G and D have enough capacity, then p_gconverges to argmin

JSD (pd¯∥ (1 − α)pg+ αp_d).

Proof. Consider V (G, D) = U (p_g, D) as a function of p_g. By the proof idea of Propo

sition 2 in [14], if f (x) = sup_α_∈Af_α(x) and f_α(x) is convex in x for every α, then

∂f_β(x) ∈ ∂f if β = argsupα∈Af_α(x). In other words, if sup_DV (G, D) is convex in p_g, the subderivatives of sup_DV (G, D) includes the derivative of the function at the point, where the maximum is attained, implying the convergence with sufficiently small updates of p_g. We complete this proof.

Chapter 5 Applications

DSGAN have been applied to three problems: semisupervised learning, robustness en

hancement of deep networks and novelty detection. As for semisupervised learning, DS

GAN acts as a “bad generator,” which creates complement samples in the feature space of real data, while DSGAN generates adversarial examples located in the lowdensity areas of training data for robustness enhancement. For novelty detection, DSGAN generates samples (unseen data) as the boundary points around training data.

5.1 SemiSupervised Learning

Semisupervised learning (SSL) is a kind of learning model with the use of a small number of labeled data and a large amount of unlabeled data. The existing SSL methods based on generative model (e.g., VAE [31] and GAN [4]) obtain good empirical results. [5] theo

retically shows that good semisupervised learning requires a bad GAN with the objective function:

max

D Ex,y∼Llog P_D(y | x, y ≤ K) + Ex∼pd(x)log P_D(y ≤ K | x) +Ex∼pg(x)log Pg(K + 1 | x) ,

(5.1)

where (x, y) denotes a pair of data and its corresponding label,{1, 2, . . . , K} denotes the label space for classification, and L = {(x, y)} is the label dataset. Moreover, in the

semisupervised settings, pd in (5.1) is the distribution of unlabeled data. Note that the discriminator D in GAN also plays the role of classifier. If the generator distribution exactly matches the real data distribution (i.e., p_g = p_d), then the classifier trained by the objective function (5.1) with the unlabeled data cannot have better performance than that trained by supervised learning with the objective function:

maxD Ex,y∼Llog P_D(y | x, y ≤ K) . (5.2)

On the contrary, the generator is preferred to generate complement samples, which lie on the lowdensity area of p_d. Under some mild assumptions, those complement samples help D to learn correct decision boundaries in the lowdensity area because the probabil

ities of true classes are forced to be low on outofdistribution areas.

The complement samples in [5] are complicate to produce. We will demonstrate that DSGAN is easy to generate complement samples in Sec. 6.

在文檔中差集生成網路--新穎資料生成 (頁 33-44)

Chapter 4

Theoretical Results

Chapter 5 Applications

5.1 Semi­Supervised Learning

5.1 SemiSupervised Learning