Novelty Detection

Novelty detection determines if a query example is from one seen class. If the samples of one seen class are considered as positive data, then this difficulty is the absence of neg

ative data in the training phase such that supervised learning cannot work. To overcome this problem, one of classical methods is the OneClass SVM (OCSVM) [37] that only requires positive data as training inputs. However, OCSVM often suffers from the curse of dimensionality due to bad computational scalability.

Recently, novelty detection has made a great progress with the advent of deep leaning.

[38][39] focus on learning a representative latent space for the one seen class. When testing, the query image is projected onto the learned latent space. Then, the difference between the query image and its inverse image (reconstruction) is measured. In other words, all we need is to train an encoder for projection and a decoder for reconstruction.

Under the circumstance, autoencoder (AE) usually is adopted to learn both encoder and decoder [38][10]. Let Enc(·) be the encoder and Dec(·) be the decoder, respectively. The loss function of AE is defined as:

min

Enc,DecEx∼ppos(x)

∥x − Dec(Enc(x))∥²2

, (5.5)

where p_pos is the distribution of one seen class. After training, a query example x_test is classified as the seen class if

∥xtest− Dec(Enc(xtest))∥²2 ≤ τ, (5.6)

where τ ∈ R⁺plays the tradeoff between true positive rate and false positive rate. How

ever, (5.6) is based on two assumptions: (1) the positive samples from one seen class should have lower reconstruction error; (2) the AE (or latent space) cannot describe nega

tive examples from unseen classes well, leading to a relatively higher reconstruction error.

In general, the first assumption inherently holds when both testing and training data come from the same seen class. However, [38][10] observed that the assumption (2) does not hold at all times because the loss function in (5.5) does not include a loss term to enforce negative data to have high reconstruction error.

To make the assumption (2) hold, given positive data as training inputs, we propose using DSGAN to generate negative examples in the latent space in Sec. 6.3. Then, the loss function of AE is modified to enforce negative data to have high reconstruction error.

Chapter 6 Experiments

In this section, we demonstrate the empirical results about semisupervised learning, ro

bustness enhancement of deep networks and novelty detection in Sec. 6.1, Sec. 6.2 and Sec. 6.3, respectively.

Note that, the training procedure of DSGAN can be improved by other extensions of GANs such as WGAN [25], WGANGP [26], EBGAN [40], LSGAN [41] and etc.

WGANGP was adopted in our method such that DSGAN is stable in training and suffers less mode collapse.

6.1 DSGAN in SemiSupervised Learning

Following the previous works, we apply the proposed DSGAN in semisupervised learn

ing on three benchmark datasets, including MNIST [42], SVHN [43], and CIFAR10 [44].

We first introduce how DSGAN generates complement samples in the feature space.

Specifically, [5] proved that if complement samples generated by G can satisfy the fol

lowing two assumptions in (6.1) and (6.2):

∀x ∼ pg(x), 0 > max

1≤i≤Kw_i^Tf (x) and∀x ∼ pd(x), 0 < max

1≤i≤Kw_i^Tf (x), (6.1)

where f is the feature extractor and wi is the linear classifier for the i^thclass and

∀x1 ∼ L, x2 ∼ pd(x),∃xg ∼ pg(x) s.t.

f (xg) = βf (x1) + (1− β)f(x2) with β ∈ [0, 1],

(6.2)

then all unlabeled data will be classified correctly via the objective function (5.1). Specif

ically, (6.1) ensures that classifiers are capable of discriminating generated data from un

labeled data, and (6.2) is to make the decision boundary locate in lowdensity areas of pd.

The assumption in (6.2) implies the complement samples have to be in the space cre

ated by linear combination of labeled and unlabeled data. Besides, they cannot fall into the real data distribution p_ddue to the assumption (6.1). In order to let DSGAN generate such samples, we let the samples of pd¯be the linear combination of those fromL and pd. Since pg(x) ≈ pd¯(x)− αpd(x)

1− α , pg will tend to match pd¯while the term−αpd ensures that samples from p_gdo not belong to p_d. Thus, p_g satisfies both assumptions in (6.1) and (6.2).

In practice, we parameterize f and all w_i’s together as a neural network. The details of the experiments, including the network architectures, can be found in Appendix 6.1.3.

6.1.1 Datasets: MNIST, SVHN, and CIFAR10

For evaluating the semisupervised learning task, we used 60000/ 73257/ 50000 samples and 10000/ 26032/ 10000 samples from the MNIST/ SVHN/ CIFAR10 datasets for train

ing and testing, respectively. Due to the semisupervised setting, we randomly chose 100/

1000/ 4000 samples from the training samples as the MNIST/ SVHN/ CIFAR10 labeled dataset, and the amount of labeled data for all classes are equal.

Our criterion to determine the hyperparameters is introduced in Appendix 6.1.3. We performed testing with 10/ 5/ 5 runs on MNIST/ SVHN/ CIFAR10 based on the selected hyperparameters and randomly selected labeled dataset. Following [5], the results are recorded as the mean and standard deviation of the number of errors from each run.

6.1.2 Main Results

First, the hyperparameters we chose are depicted in Table 6.3 in Appendix 6.1.3. Second, the results obtained from our DSGAN and the stateoftheart methods on three benchmark datasets are depicted in Table 6.2. The effectiveness of applying the tricks in Sec. 3.4 is show in Table 6.1, and it validate the influence of the tricks.

Table 6.1: Semisupervised learning results on MNIST whether to use the sampling tricks.

Methods MNIST (# errors)

Our method w/o tricks 91.0± 7.0 Our method w/ tricks 82.7± 4.6

It can be observed that our results can compete with stateoftheart methods on the three datasets. Note that the results of badGAN [5] were reproduced by the released codes of the authors.

In comparison with [5], our methods don’t need to rely on an additional density es

timation network PixelCNN++ [45]. Although PixelCNN++ is one of the best density estimation network, it cannot estimate the density in the feature space, which is dynamic during training. This drawback makes the models in [5] fail to fulfill the assumptions in their paper.

Moreover, it can also be observed in Table 6.2 that our results are comparable to the best record of badGAN [5] and are better than other approaches in MNIST and SVHN. In CIFAR10, our method is only inferior to CTGAN. It might not be a fair comparison since CTGAN uses extra techniques, including temporal ensembling and data augmentation, but other methods do not.

Table 6.2: Comparison of semisupervised learning between our DSGAN and stateof

theart methods: CatGAN [2], TripleGAN [3], FM [4], badGAN [5] and CTGAN [6].

For a fair comparison, we only consider the GANbased methods. ∗ indicates the use of the same architecture of classifier.† indicates a larger architecture of classifier. ‡ indicates the use of data augmentation. The results for MNIST are recorded in the number of errors while the others are in percentage of errors.

Methods MNIST SVHN CIFAR10

CatGAN 191± 10 19.58± 0.46

TripleGAN^† 91± 58 5.77± 0.17 16.99 ± 0.36 FM^∗ 93± 6.5 8.11± 1.3 18.63± 1.32 badGAN^∗ 86.2± 13.2 4.48 ± 0.16 16.25 ± 0.33

CTGAN^‡ 9.98± 0.21

Our method^∗ 82.7± 4.6 4.88± 0.07 15.08 ± 0.24

6.1.3 Appendix: Experimental Details

Hyperparameters

The hyperparameters were chosen to make our generated samples consistent with the as

sumptions in (6.1) and (6.2). However, in practice, if we make all the samples produced by the generator following the assumption in (6.2), then the generated distribution is not close to the true distribution, even a large margin between them existing in most of the time, which is not what we desire. So, in our experiments, we make a concession that the percentage of generated samples, which accords with the assumption, is around 90%. To meet this objective, we tune the hyperparameters. Table 6.3 shows our setting of hyper

parameters, where β is defined in (6.2).

Table 6.3: Hyperparameters in semisupervised learning.

Hyperparameters MNIST SVHN CIFAR10

α 0.8 0.8 0.5

β 0.3 0.1 0.1

Architecture

In order to fairly compare with other methods, our generators and classifiers for MNIST, SVHN, and CIFAR10 are same as in [4] and [5]. However, different from previous works that have only a generator and a discriminator, we design an additional discriminator in the feature space, and its architecture is similar across all datasets with only the difference in the input dimensions. Following [5], we also define the feature space as the input space of the output layer of discriminators.

Compared to SVHN and CIFAR10, MNIST is a simple dataset as it is only composed of fully connected layers. Batch normalization (BN) or weight normalization (WN) is used to every layer to stable training. Moreover, Gaussian noise is added before each layer in the classifier, as proposed in [46]. We find that the added Gaussian noise exhibits a positive effect for semisupervised learning and keep to use it. The architecture is shown in Table 6.4.

Table 6.5 and Table 6.6 are models for SVHN and CIFAR10, respectively, and these models are almost the same except for some implicit differences, e.g., the number of con

volutional filters and types of dropout. In these tables, given a dropping rate, “Dropout”

is a normal dropout in that the elements of input tensor are randomly set to zero while Dropout2d is a dropout only applied on the channels to randomly zero all the elements.

Furthermore, the training procedure alternates between k steps of optimizing D and one step of optimizing G. We find that k in Algorithm 1 is a key role in the problem of mode collapse for different applications. For semisupervised learning, we set k = 1 for all datasets.

在文檔中差集生成網路--新穎資料生成 (頁 46-53)

Chapter 6 Experiments

6.1 DSGAN in Semi­Supervised Learning

6.1.1 Datasets: MNIST, SVHN, and CIFAR­10

6.1.2 Main Results

6.1.3 Appendix: Experimental Details

6.1 DSGAN in SemiSupervised Learning

6.1.1 Datasets: MNIST, SVHN, and CIFAR10