• 沒有找到結果。

Novelty Detection

Novelty detection determines if a query example is from one seen class. If the samples of one seen class are considered as positive data, then this difficulty is the absence of neg­

ative data in the training phase such that supervised learning cannot work. To overcome this problem, one of classical methods is the One­Class SVM (OCSVM) [37] that only requires positive data as training inputs. However, OCSVM often suffers from the curse of dimensionality due to bad computational scalability.

Recently, novelty detection has made a great progress with the advent of deep leaning.

[38][39] focus on learning a representative latent space for the one seen class. When testing, the query image is projected onto the learned latent space. Then, the difference between the query image and its inverse image (reconstruction) is measured. In other words, all we need is to train an encoder for projection and a decoder for reconstruction.

Under the circumstance, autoencoder (AE) usually is adopted to learn both encoder and decoder [38][10]. Let Enc(·) be the encoder and Dec(·) be the decoder, respectively. The loss function of AE is defined as:

min

Enc,DecEx∼ppos(x)

∥x − Dec(Enc(x))∥22

, (5.5)

where ppos is the distribution of one seen class. After training, a query example xtest is classified as the seen class if

∥xtest− Dec(Enc(xtest))22 ≤ τ, (5.6)

where τ ∈ R+plays the trade­off between true positive rate and false positive rate. How­

ever, (5.6) is based on two assumptions: (1) the positive samples from one seen class should have lower reconstruction error; (2) the AE (or latent space) cannot describe nega­

tive examples from unseen classes well, leading to a relatively higher reconstruction error.

In general, the first assumption inherently holds when both testing and training data come from the same seen class. However, [38][10] observed that the assumption (2) does not hold at all times because the loss function in (5.5) does not include a loss term to enforce negative data to have high reconstruction error.

To make the assumption (2) hold, given positive data as training inputs, we propose using DSGAN to generate negative examples in the latent space in Sec. 6.3. Then, the loss function of AE is modified to enforce negative data to have high reconstruction error.

Chapter 6 Experiments

In this section, we demonstrate the empirical results about semi­supervised learning, ro­

bustness enhancement of deep networks and novelty detection in Sec. 6.1, Sec. 6.2 and Sec. 6.3, respectively.

Note that, the training procedure of DSGAN can be improved by other extensions of GANs such as WGAN [25], WGAN­GP [26], EBGAN [40], LSGAN [41] and etc.

WGAN­GP was adopted in our method such that DSGAN is stable in training and suffers less mode collapse.

6.1 DSGAN in Semi­Supervised Learning

Following the previous works, we apply the proposed DSGAN in semi­supervised learn­

ing on three benchmark datasets, including MNIST [42], SVHN [43], and CIFAR­10 [44].

We first introduce how DSGAN generates complement samples in the feature space.

Specifically, [5] proved that if complement samples generated by G can satisfy the fol­

lowing two assumptions in (6.1) and (6.2):

∀x ∼ pg(x), 0 > max

1≤i≤KwiTf (x) and∀x ∼ pd(x), 0 < max

1≤i≤KwiTf (x), (6.1)

where f is the feature extractor and wi is the linear classifier for the ithclass and

∀x1 ∼ L, x2 ∼ pd(x),∃xg ∼ pg(x) s.t.

f (xg) = βf (x1) + (1− β)f(x2) with β ∈ [0, 1],

(6.2)

then all unlabeled data will be classified correctly via the objective function (5.1). Specif­

ically, (6.1) ensures that classifiers are capable of discriminating generated data from un­

labeled data, and (6.2) is to make the decision boundary locate in low­density areas of pd.

The assumption in (6.2) implies the complement samples have to be in the space cre­

ated by linear combination of labeled and unlabeled data. Besides, they cannot fall into the real data distribution pddue to the assumption (6.1). In order to let DSGAN generate such samples, we let the samples of pd¯be the linear combination of those fromL and pd. Since pg(x) pd¯(x)− αpd(x)

1− α , pg will tend to match pd¯while the term−αpd ensures that samples from pgdo not belong to pd. Thus, pg satisfies both assumptions in (6.1) and (6.2).

In practice, we parameterize f and all wi’s together as a neural network. The details of the experiments, including the network architectures, can be found in Appendix 6.1.3.

6.1.1 Datasets: MNIST, SVHN, and CIFAR­10

For evaluating the semi­supervised learning task, we used 60000/ 73257/ 50000 samples and 10000/ 26032/ 10000 samples from the MNIST/ SVHN/ CIFAR­10 datasets for train­

ing and testing, respectively. Due to the semi­supervised setting, we randomly chose 100/

1000/ 4000 samples from the training samples as the MNIST/ SVHN/ CIFAR­10 labeled dataset, and the amount of labeled data for all classes are equal.

Our criterion to determine the hyperparameters is introduced in Appendix 6.1.3. We performed testing with 10/ 5/ 5 runs on MNIST/ SVHN/ CIFAR­10 based on the selected hyperparameters and randomly selected labeled dataset. Following [5], the results are recorded as the mean and standard deviation of the number of errors from each run.

6.1.2 Main Results

First, the hyperparameters we chose are depicted in Table 6.3 in Appendix 6.1.3. Second, the results obtained from our DSGAN and the state­of­the­art methods on three benchmark datasets are depicted in Table 6.2. The effectiveness of applying the tricks in Sec. 3.4 is show in Table 6.1, and it validate the influence of the tricks.

Table 6.1: Semi­supervised learning results on MNIST whether to use the sampling tricks.

Methods MNIST (# errors)

Our method w/o tricks 91.0± 7.0 Our method w/ tricks 82.7± 4.6

It can be observed that our results can compete with state­of­the­art methods on the three datasets. Note that the results of badGAN [5] were reproduced by the released codes of the authors.

In comparison with [5], our methods don’t need to rely on an additional density es­

timation network PixelCNN++ [45]. Although PixelCNN++ is one of the best density estimation network, it cannot estimate the density in the feature space, which is dynamic during training. This drawback makes the models in [5] fail to fulfill the assumptions in their paper.

Moreover, it can also be observed in Table 6.2 that our results are comparable to the best record of badGAN [5] and are better than other approaches in MNIST and SVHN. In CIFAR­10, our method is only inferior to CT­GAN. It might not be a fair comparison since CT­GAN uses extra techniques, including temporal ensembling and data augmentation, but other methods do not.

Table 6.2: Comparison of semi­supervised learning between our DSGAN and state­of­

the­art methods: CatGAN [2], TripleGAN [3], FM [4], badGAN [5] and CT­GAN [6].

For a fair comparison, we only consider the GAN­based methods. ∗ indicates the use of the same architecture of classifier.† indicates a larger architecture of classifier. ‡ indicates the use of data augmentation. The results for MNIST are recorded in the number of errors while the others are in percentage of errors.

Methods MNIST SVHN CIFAR­10

CatGAN 191± 10 ­ 19.58± 0.46

TripleGAN 91± 58 5.77± 0.17 16.99 ± 0.36 FM 93± 6.5 8.11± 1.3 18.63± 1.32 badGAN 86.2± 13.2 4.48 ± 0.16 16.25 ± 0.33

CT­GAN ­ ­ 9.98± 0.21

Our method 82.7± 4.6 4.88± 0.07 15.08 ± 0.24

6.1.3 Appendix: Experimental Details

Hyperparameters

The hyperparameters were chosen to make our generated samples consistent with the as­

sumptions in (6.1) and (6.2). However, in practice, if we make all the samples produced by the generator following the assumption in (6.2), then the generated distribution is not close to the true distribution, even a large margin between them existing in most of the time, which is not what we desire. So, in our experiments, we make a concession that the percentage of generated samples, which accords with the assumption, is around 90%. To meet this objective, we tune the hyperparameters. Table 6.3 shows our setting of hyper­

parameters, where β is defined in (6.2).

Table 6.3: Hyperparameters in semi­supervised learning.

Hyperparameters MNIST SVHN CIFAR­10

α 0.8 0.8 0.5

β 0.3 0.1 0.1

Architecture

In order to fairly compare with other methods, our generators and classifiers for MNIST, SVHN, and CIFAR­10 are same as in [4] and [5]. However, different from previous works that have only a generator and a discriminator, we design an additional discriminator in the feature space, and its architecture is similar across all datasets with only the difference in the input dimensions. Following [5], we also define the feature space as the input space of the output layer of discriminators.

Compared to SVHN and CIFAR­10, MNIST is a simple dataset as it is only composed of fully connected layers. Batch normalization (BN) or weight normalization (WN) is used to every layer to stable training. Moreover, Gaussian noise is added before each layer in the classifier, as proposed in [46]. We find that the added Gaussian noise exhibits a positive effect for semi­supervised learning and keep to use it. The architecture is shown in Table 6.4.

Table 6.5 and Table 6.6 are models for SVHN and CIFAR­10, respectively, and these models are almost the same except for some implicit differences, e.g., the number of con­

volutional filters and types of dropout. In these tables, given a dropping rate, “Dropout”

is a normal dropout in that the elements of input tensor are randomly set to zero while Dropout2d is a dropout only applied on the channels to randomly zero all the elements.

Furthermore, the training procedure alternates between k steps of optimizing D and one step of optimizing G. We find that k in Algorithm 1 is a key role in the problem of mode collapse for different applications. For semi­supervised learning, we set k = 1 for all datasets.

相關文件