Most adversarial example generation strategies are based on developing an algorithm that extrapolates appropriate perturbations from benign data and adds them back onto those data to turn into adversarial examples. Fig. 3.1 illustrates the high level flow of such gen-eration process. As a consequence of perturbing from existing data, the generated images resemble their unperturbed version, which is not rigorously proved yet obviously limit the adversarial diversity. To complement the vacancy of the data diversity, we propose to generate adversarial examples directly from conditional generative adversarial models.
We first denote the target classifier T C that adversaries aim to attack T C :Rm → Rn , where m is the dimension of images and n is the dimension of the image class. Given a class label c and the attack target label t, our goal is to train a generator G that generates adversarial images x′ = G(c, z) such that x′ corresponds to the class c and meanwhile causes misclassification that satisfies T C(x′) = t.
doi:10.6342/NTU201803656 Figure 3.1: Illustration of a typical perturbation-based method. (The iconic panda images
are adopted from the representative work of Goodfellow et al. [8])
to be less influential, keeping the generated images as realistic as those expected from typical GANs.
To achieve our goal of adversarial attacking, we thus set the objective to minimize a certain classifier’s loss w.r.t. certain attack target class. Then, realistic image samples pro-duced by the generator might be able to cause misclassification to the classifier if trained properly.
Besides, controlling the class of the generated image samples is essentially needed for the targeted attack. Hence, we chose class-conditional GANs [19, 21] as our mainframe GAN network, and impose the generator with the additional objective in expect to produce class-specified images that can mislead the target classifier.
In order to realize the objective in practice, we designed our model as a class-conditional-GAN network concatenated by the classifier of the attack target, as shown in Fig. 3.2b.
This architecture design allows us to obtain the attacking loss (which is calculated by feed-ing the generated images to the target classifier) and back-propagates it to the generator in a straightforward manner.
At the training stage, the generator takes class labels c, the random noise z and es-pecially the label of attack target t as inputs (Fig. 3.2b), contrary to conventional class-conditional GANs which only take the first two (Fig. 3.2a). Then, it is trained to mini-mize both the adversarial loss and the attacking loss at the same time. Under this scenario, the generator acts like an adaptable animal that can adapt to external environmental
con-8
(a) Class-Conditional GAN (b) Adaptive GAN
Figure 3.2: The high level architecture of (a) a typical contional-GAN variant and (b) the Adaptive GAN.
straints. Actually, it is this behavior of adaptation that we call our framework Adaptive GAN.
The discriminator, on the other hand, is trained as conventional GANs to distinguish between fake and real images, while the target classifier stays weight-frozen. Note that our architecture is somehow similar to AdvGAN in [29], we think it’s a coincidence on vision, and each architecture actually functions differently from one another..
After the training stage, the generated images shall meet the following two goals: re-alistic for the discriminator (for humans), but misleading for the target classifier.
Note that since we need to calculate the attack loss during the training step, we must have access to the target classifier architecture as well as its weight (i.e. in the white-box settings), which is the basic limitation of Adaptive GAN. Yet, the recent work [23]
to address the black-box attack problem by training substitute networks can be combined with our approach.
doi:10.6342/NTU201803656
where I is an indicator function that takes the value 1 iff the predicted label T C(x′) does not equal to the target label t and is zero otherwise, and J denotes the classification loss function used for the target classifier w.r.t. the attacking target t (e.g. cross-entropy).
Intuitively, Lattack may take the value of the expected classification loss for T C(x) w.r.t. to the attack target t:
Lattack = E[J (T C(x′), t)] = 1 n
∑N n=1
J (T C(x′), t) (3.2)
, which would directly reflect the attack effectiveness and is exactly Equation (3.1) without the indicator function.
We observed that directly take Equation (3.2) as the attacking loss would make the generator fit too much on this objective. Namely, the generated images would end up in a strange shape between the given class and the target class. To alleviate this problem, we set an indicator function to mask out only the losses of images associated with the unsuccessful attack, and leave successful ones less influential. Thus, by optimizing on this adjusted objective, the generator is only trained to produce images to be misclassified as the target label, but not tuned to increase the confidence of misclassification. Therefore, the resulting adversarial examples would end up with little distortion that is just enough to fool the target classifier. An alternative choice for balancing the loss contribution between the successful attack and the unsuccessful attack may be using a weighted mask with weight β < 1. We will later discuss the impact of the indicator function in the experiment in Chapter 4.1.
Combining Lattackwith LcGAN, the loss function for the mainframe conditional GAN, yields the final objective function:
minG max
D LcGAN(G, D) + αLattack(G(z, c)) (3.3)
where α serves as the controlling parameter to balance the influence between the adver-10
sarial loss and the attacking loss. Then the optimization procedure for GANs can be con-ducted to train the generator to meet both objectives: generating the specified class of images, while keeping them to be misclassified as the target type.
3.4 Turning Pre-trained Generator into Adversarial-example Generator
Instead of training from scratch, however, the ”adapting procedure” can be directly applied to pre-trained class-conditional GANs. We found that this procedure takes only a few epochs to complete, since slight differences on the output pixels are enough to cause the desired misclassification.
We call it adaptive retraining that provides a way to instantly turn a pre-trained model into an adversarial-example generator within few epochs of retraining, which results in slight changes to the output images given the same class condition and the noise input.
In addition, as a result of limited changes, the generative power of the original class-conditional GAN as well as the visual appearance of images produced by it are both re-served.
Note that there are situations in which the generated adversarial examples are not as natural and convincing as real-world images, we argue that it caused by insufficient ability of GAN to describe the dataset, since the retraining only has a limited impact on the output.
A more advanced model design for GAN shall be able to address this problem.
doi:10.6342/NTU201803656
Chapter 4
Experimental results
In this chapter, first we compare the effect of using different attacking loss functions in the training process. Then, we apply our method on two pubic adversarial attack challenges.
For those experimental results hard to be quantitatively evaluated, we offer images for human assessment. Our code and models will be available upon publication.
4.1 The Effect of Masked Loss Function
We first discuss the effect of adding an indicator function when calculating the attacking loss. The indicator function serves as a mask to filter out only the loss of the unsuccessful attack (Equation 3.1), instead of the overall classification loss (Equation 3.2). We claimed that without masking, the generated adversarial examples would overfit the attack target, ending up in a shape between the original class and the target class.
To depict this scenario, we sampled ”1” and ”7” images targeted to be classified as
”2” from models using Equation 3.2 and Equation 3.1 as the attacking loss respectively.
As can be seen in Fig. 4.1a, the ”1” and ”7” images trained with the former would grow spots in the head and tail, which makes them more like the attack target ”2”. On the other hand, adopting the masked loss could prevent such overfitting, as shown in Fig. 4.1b.
12
(a) From models trained with naive classification loss
(b) From models trained with masked loss
Figure 4.1: Comparison of the images sampled from models with attacking loss calculated using Equation 2 and Equation1 respectively.
4.2 Adversarial Attack on MNIST
We then applied our attack strategy on MNIST Challenge [17], a public challenge pro-posed by [18] specifically for MNIST dataset [14]. The challenge offers a robust classi-fier trained with adversarial training based on CW’s method[18], and calls for adversarial attacks that could achieve the highest attack success rate. Attackers participating in chal-lenge are allowed to perturb each pixel of target images by at most epsilon=0.3 to prevent image distortion.
Apparently, non-perturbation-based methods (i.e. Adaptive GAN) is hard to follow this max-norm constraint, and, in fact, should not follow such constraint by design. This makes direct comparison between two kinds of approaches a bit of unfair. However, due to lack of direct evaluation metrics, we still chose to apply our attack strategy on the chal-lenge to see the attack effectiveness on a robustly-trained network. Also, the performance of other perturbation-based methods on leaderboard are listed for benchmarking (the per-formance has been converted from accuracy to attack successful rate).
From Table 4.1, we can see the highest attack success rate that Adaptive-GAN con-tributed to the target classifier. We consider it as a reasonable result since there is no max-perturbation constraint on non-max-perturbation-based method. Yet the result still supports that our approach is effective in creating adversarial examples against robust classifiers.
doi:10.6342/NTU201803656 Table 4.1: Attack success rate of the target classifier (MNIST Challenge)
Attack Attack success rate 100-step PGD on the cross-entropy loss
with 50 random restart 10.38%
100-step PGD on the CW loss
with 50 random restarts 10.29%
100-step PGD on the cross-entropy loss 7.48%
100-step PGD on the CW loss 6.96%
FGSM on the cross-entropy loss 3.64%
FGSM on the CW loss 3.60%
*Adaptive GAN 99.90%
Figure 4.2: Adversarial examples generated from Adaptive GAN on MNIST with each row conditioned on different class label and each column conditioned on different attack target.
4.3 Adversarial Attack on CIFAR10
Next, we applied our attack strategy on CIFAR10 challenge [16], a public challenge pro-posed by [18] for CIFAR10 dataset [12]. The challenge provides a robust classifier trained with adversarial training based on CW’s method, and invite adversaries to submit adver-sarial examples against the classifier.
Instead of training an adversarial-example generator from scratch, we used adaptive retraining on top of a pre-trained model this time. By starting from a pre-trained model, we can then shorten the overall training time since CIFAR10 is a more time-consuming dataset for build GANs. Also, we can take the images generated from the pre-trained model as ground truth in some sense, to verify the claim the re-training would not cause
14
too much change to the original conditional GAN’s output. A pair of the images sampled from the original and retrained models is shown in Figure 4.3.
Table 4.2 shows the attack success rate to the target classifier under different attacks, including Adaptive GAN and other perturbation-based methods on leaderboard (the per-formance has been converted from accuracy to successful rate). We can see that Adaptive-GAN outperforms others. Again, we have to emphasize that unlike other perturbation-based strategies, Adaptive GAN cannot limit the size of perturbation by nature, so it is not surprising that it achieves higher attack success rate than other methods. Yet the result still suggests that our method is competitive with other approaches.
As far as image quality is concerned, we have to admit that some of the generated adversarial examples are not convincing enough as those generated from perturbation-based methods. We think it is a practical issue for all generative models. Since we only provides a retraining framework which causes little change to the output of the pre-trained generative model. And, to the best of our effort, we have constructed our model based on ACGAN[21] and DCGAN[24] together with state-of-the-art techniques including batch normalization [10], leaky relu as activation function, label smoothing [26] and minibatch discrimination [26] in this experiment. We think to improve the overall quality of images generated from GAN is beyond the scope of this work, and await for further research.
Table 4.2: Accuracy of the target classifier (CIFAR10 Challenge) Attack Attack success rate 20-step PGD on the cross-entropy loss 52.94%
20-step PGD on the CW loss 52.24%
FGSM on the CW loss 45.08%
doi:10.6342/NTU201803656 Figure 4.3: Adversarial examples sampled from the process of adaptive retraining on
CI-FAR10 Challenge, where the accuracy has been reduced by 12% after two epochs of re-training.
4.4 Cross-domain attacks
Adaptive GAN can also be applied to cross-domain attacks: for example, Logo Attacks or Traffic Sign Attacks [27]. We take a scenario of Traffic Sign Attack as example, and plot our attack images in Fig. 4.4.
In the attack scenario, we want to misguide an autonomous vehicle by faking non-traffic-sign images that would be interpreted by the vehicle as some traffic signs, then causing misbehavior of the vehicle. We applied adaptive retraining on a GAN pre-trained with CIFAR-10 dataset and a classifier trained to classify traffic signs. After the retraining process, our generator is capable of generating images of transportation or animals to be classified as traffic signs. In Fig. 4.4 images of a car and a horse are presented, and both of them are interpreted as a ”watch for pedestrians sign” by the classifier serving as simulated computer vision system of autonomous vehicle in our experiment.
16
Figure 4.4: Cross-domain Adversarial examples. An image of horse and an image of car to be misinterpreted as a ”watch for pedestrians sign” by some classifier on an autonomous vehicle are generated with Adaptive GAN.
doi:10.6342/NTU201803656
Chapter 5 Conclusion
In this work, we study the perturbation-based nature for most adversarial-example generat-ing methods that create examples mainly by perturbgenerat-ing existgenerat-ing data which limits the data space in some manner. We then engage to complement these perturbation-based strategies by proposing Adaptive GAN to directly produce examples with a modified conditional-GAN framework. Our method expands the adversarial-example subspace, since images will not have to take the appearance of existing data. We then experimented on MNIST and CIFAR10 to verify our original thinking.
A clear downside for Adaptive GAN is that it needs to be trained several times for dif-ferent targets and be retrained every time when the defender modifies its model. However, we have showed in experiments that our method can apply to pre-trained GANs, turning them into adversarial-example generators. A re-training process to slightly modify the pre-trained model takes only a few epochs, which means the issue of training time would not limit our method.
We think the practical limitation to Adaptive GAN is that the perceptual quality of the generated adversarial examples depends heavily on the generative power of its mainframe GAN model. Joint effort across several GAN training techniques is required to generate acceptable quality of adversarial examples. And, in fact, the quality may not be as good when compared to those generated from traditional perturbation-based methods. In gen-eral, we still think Adaptive GAN is a promising alternative to current attack strategies since it provides a generative framework to create native adversarial examples which is
18
irreplaceable to other methods.
doi:10.6342/NTU201803656 20
Bibliography
[1] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–
402. Springer, 2013.
[2] B. Biggio, G. Fumera, and F. Roli. Security evaluation of pattern classifiers under attack. IEEE transactions on knowledge and data engineering, 26(4):984–996, 2014.
[3] B. Biggio, B. Nelson, and P. Laskov. Poisoning attacks against support vector ma-chines. arXiv preprint arXiv:1206.6389, 2012.
[4] J. Bradshaw, A. G. d. G. Matthews, and Z. Ghahramani. Adversarial examples, uncertainty, and transfer testing robustness in gaussian process hybrid deep networks.
arXiv preprint arXiv:1707.02476, 2017.
[5] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39–57. IEEE, 2017.
[6] C. Donahue, J. McAuley, and M. Puckette. Synthesizing audio with generative ad-versarial networks. arXiv preprint arXiv:1802.04208, 2018.
[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[8] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
[9] C. Guo, M. Rana, M. Cissé, and L. van der Maaten. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117, 2017.
[10] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[11] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with
con-doi:10.6342/NTU201803656 [14] Y. LeCun. The mnist database of handwritten digits. http://yann. lecun.
com/exdb/mnist/, 1998.
[15] X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, M. E. Houle, G. Schoenebeck, D. Song, and J. Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613, 2018.
[16] A. Madry, A. Makelov, Schmidt, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus.
CIFAR10 Adversarial Examples Challenge. https://github.com/MadryLab/
cifar10_challenge. Accessed: 2018-05-06.
[17] A. Madry, A. Makelov, Schmidt, J. Bruna, D. Erhan, I. Goodfellow, and R. Fer-gus. MNIST Adversarial Examples Challenge. https://github.com/MadryLab/
mnist_challenge. Accessed: 2018-05-06.
[18] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
[19] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
[20] S. M. Moosavi Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), number EPFL-CONF-218057, 2016.
[21] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary clas-sifier gans. arXiv preprint arXiv:1610.09585, 2016.
[22] A. Osokin, A. Chessel, R. E. C. Salas, and F. Vaggi. Gans for biological image syn-thesis. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2252–2261. IEEE, 2017.
[23] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pages 506–519. ACM, 2017.
[24] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learn-ing with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[25] S. Rajeswar, S. Subramanian, F. Dutil, C. Pal, and A. Courville. Adversarial gener-ation of natural language. arXiv preprint arXiv:1705.10929, 2017.
[26] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Im-proved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
[27] C. Sitawarin, A. N. Bhagoji, A. Mosenia, M. Chiang, and P. Mittal. Darts: Deceiving autonomous cars with toxic signs. arXiv preprint arXiv:1802.06430, 2018.
22
[28] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fer-gus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
[29] C. Xiao, B. Li, J.-Y. Zhu, W. He, M. Liu, and D. Song. Generating adversarial examples with adversarial networks. arXiv preprint arXiv:1801.02610, 2018.
[30] W. Xu, D. Evans, and Y. Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155, 2017.
[31] Z. Zhao, D. Dua, and S. Singh. Generating natural adversarial examples. arXiv preprint arXiv:1710.11342, 2017.
[32] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.