• 沒有找到結果。

Appendix: More Results for Case Study

Additional results for boundary samples generation and difference set generation are pre­

sented in Fig. 3.7 and Fig. 3.8, respectively.

(a) α = 0.30 (b) α = 0.50

(c) α = 0.80 (d) α = 0.95

Figure 3.6: Demonstrate the influence of α on the synthetic dataset. In this example, pd is the orange rectangle (bounded by x <= 1, x >=−1, y >= −0.8 and y <= 0.8), and pd¯is the rectangle which is shifted pd right by 1 unit (not appear in the figures). We can observe that pg is farther away from pd(green points) when α increases. When α is 0.5, pglearns perfect difference between pd¯and pd. When α is 0.95, pggenerates the rightmost points of pd¯. The contour is the output of the discriminator, the place with higher score the generator going. Note that the outputs of the discriminator are not restricted in [0, 1], because we use WGAN’s structure in this experiment.

(a) S shape with compact data points. α = 0.9. (b) S shape with scattering data points. α = 0.9.

(c) 4 Gaussians. α = 0.9. (d) 8 Gaussians. α = 0.8.

Figure 3.7: Extra 2D results for boundary sample generation. The orange points are data points, and the green points are generated points.

Figure 3.8: Difference set generation for CelebA dataset ([1]). pd¯is 20000 images from CelebA dataset. In pd¯, 1000 images are humans with glasses while others are ones without glasses. pd all contains human wearing glasses, and its size is 19000. In this case, our generator successfully learned to produce images which are human with glasses. Note that α = 0.95.

Chapter 4

Theoretical Results

There are two assumptions for subsequent proofs. First, in a nonparametric setting, we assume both generator and discriminator have infinite capacity. Second, pg is defined as the distribution of the samples drawn from G(z) under z ∼ pz. We will first show the optimal discriminator given G and then show that minimizing V (G, D) via G given the optimal discriminator is equivalent to minimizing the Jensen­Shannon divergence between (1− α)pg+ αpdand pd¯.

Proposition 1. For G being fixed, the optimal discriminator D is

DG(x) = pd(x)

pd(x) + (1− α)pg(x) + αpd(x).

Proof. Given any generator G, the training criterion for the discriminator D is to maximize the quantity V (G, D):

For any (a, b)∈ R2\{0, 0}, the function a log (y)+b log (1 − y) achieves its maximum in [0, 1] at y = a+ba . The discriminator only needs to be defined within Supp(pd¯)S

Supp(pd)S

Supp(pg).

We complete this proof.

Moreover, D can be considered to discriminate between samples from pd¯and those from ((1− α)pg(x) + αpd(x)). By replacing the optimal discriminator into V (G, D), we

Actually, the results so far show the optimal solution of D given G being fixed in (4.1).

Now, the next step is to find the optimal G with DGbeing fixed.

Theorem 1. The global minimum of C(G) is achieved if and only if (1− α)pg(x) + αpd(x) = pd(x)¯ for all x’s. At that point, C(G) achieves the value− log 4.

Proof. We start from the Jensen­Shannon divergence. The JSD returns the minimal value, which is 0, iff both distributions are the same, namely pd¯ = (1− α)pg + αpd. Because pg(x)’s are always non­negative, it should be noted both distributions are the same only if αpd(x) ≤ pd¯(x) for all x’s. We complete this proof.

Note that (1−α)pg(x)+αpd(x) = pd(x)¯ may not hold if αpd(x) > pd(x)¯ . But, DSGAN still works based on two facts: i) given D, V (G, D) is a convex function in pgand ii) due to

Z

x

pg(x)dx = 1, the set collecting all feasible solutions of pgis convex. In other words, there always exists a global minimum of V (G, D) given D, but it may not be− log(4).

In the following, we show that the support set of pg is contained within the difference of support sets between pd¯and pdwhile achieving the global minimum such that we can generate the desired pg by designing appropriate pd¯.

Proposition 2. Suppose αpd(x) ≥ pd¯(x) for all x ∈ Supp(pd) and all density functions pd(x), pd¯(x) and pg(x) are continuous. If the global minimum of C(G) is achieved, then

Supp (pg)⊆ Supp (pd¯)− Supp(pd).

Proof. Recall non­increasing and S(pg; x)≤ 0 always holds. Specifically, S(pg; x) is decreasing along the increase of pg(x) if pd¯(x) > 0; S(pg; x) attains the maximum value, zero, for any pg(x) if pd¯(x) = 0. Since DSGAN aims to minimize C(G) with the constraint

Z

x

pg(d)dx = 1, the solution attaining the global minima must satisfy pg(x) = 0 if pd¯(x) = 0; otherwise, there exists another solution with smaller value of C(G). Thus, Supp (pg)⊆ Supp (pd¯).

Furthermore, T (pg; x) = ∂S(pg; x) is expected to be as small as possible to minimize C(G), is increasing on pg(x) and con­

verges to 0. Then, we show that T (pg; x) for x ∈ Supp(pd¯)T

Supp(pd) is always larger than that for x∈ Supp(pd¯)− Supp(pd) for all pg. Specifically,

Based on the following expression,

the last inequality implies that there must exist a feasible solution. We complete this proof.

In sum, the generator is prone to output samples located in high­density areas of pd¯ αpd.

Another concern is the convergence of Algorithm 1.

Proposition 3. The discriminator reaches its optimal value given G in Algorithm 1, and pg is updated by minimizing

Ex∼pd¯(x)[log DG(x)] +Ex∼p(x)[log (1− DG (x))] .

If G and D have enough capacity, then pgconverges to argmin

pg

JSD (pd¯∥ (1 − α)pg+ αpd).

Proof. Consider V (G, D) = U (pg, D) as a function of pg. By the proof idea of Propo­

sition 2 in [14], if f (x) = supα∈Afα(x) and fα(x) is convex in x for every α, then

∂fβ(x) ∈ ∂f if β = argsupα∈Afα(x). In other words, if supDV (G, D) is convex in pg, the subderivatives of supDV (G, D) includes the derivative of the function at the point, where the maximum is attained, implying the convergence with sufficiently small updates of pg. We complete this proof.

Chapter 5 Applications

DSGAN have been applied to three problems: semi­supervised learning, robustness en­

hancement of deep networks and novelty detection. As for semi­supervised learning, DS­

GAN acts as a “bad generator,” which creates complement samples in the feature space of real data, while DSGAN generates adversarial examples located in the low­density areas of training data for robustness enhancement. For novelty detection, DSGAN generates samples (unseen data) as the boundary points around training data.

5.1 Semi­Supervised Learning

Semi­supervised learning (SSL) is a kind of learning model with the use of a small number of labeled data and a large amount of unlabeled data. The existing SSL methods based on generative model (e.g., VAE [31] and GAN [4]) obtain good empirical results. [5] theo­

retically shows that good semi­supervised learning requires a bad GAN with the objective function:

max

D Ex,y∼Llog PD(y | x, y ≤ K) + Ex∼pd(x)log PD(y ≤ K | x) +Ex∼pg(x)log Pg(K + 1 | x) ,

(5.1)

where (x, y) denotes a pair of data and its corresponding label,{1, 2, . . . , K} denotes the label space for classification, and L = {(x, y)} is the label dataset. Moreover, in the

semi­supervised settings, pd in (5.1) is the distribution of unlabeled data. Note that the discriminator D in GAN also plays the role of classifier. If the generator distribution exactly matches the real data distribution (i.e., pg = pd), then the classifier trained by the objective function (5.1) with the unlabeled data cannot have better performance than that trained by supervised learning with the objective function:

maxD Ex,y∼Llog PD(y | x, y ≤ K) . (5.2)

On the contrary, the generator is preferred to generate complement samples, which lie on the low­density area of pd. Under some mild assumptions, those complement samples help D to learn correct decision boundaries in the low­density area because the probabil­

ities of true classes are forced to be low on out­of­distribution areas.

The complement samples in [5] are complicate to produce. We will demonstrate that DSGAN is easy to generate complement samples in Sec. 6.

相關文件