Security for Deep Neural Networks
Cho-Jui Hsieh
Depts of Computer Science and Statistics UC Davis
NTU Machine Learning Symposium Dec 23, 2017
Joint work with Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Yash Sharma, Hongge Chen, Xuanqing Liu, Tsui-Wei Weng, Minhao Cheng
Outline
Attack algorithms Review recent works
ZOO: Zeroth Order Optimization for black-box attack Defensive algorithms:
Random Self Ensemble for defense.
Adversarial Examples
Given:
Targeted model f (·)
Targeted example x0(correctly classified) Adversarial example:
Another example x close to x0 but f (x ) 6= f (x0)
Classification Image captioning
Can this happen in real world?
Stop sign mis-classified as speed limit:
“Robust Physical-World Attacks on Deep Learning Models”, 2017.
Real world adversarial turtle:
“Synthesizing Robust Adversarial Examples”, 2017.
(see https://youtu.be/qPxlhGSG0tc)
Attack ML models
Threat model
Given:
Targeted model f (·)
Targeted example x0(correctly classified by f , so f (x0) = y0) Attacker constructs an adversarial example x that isclose to x0 Two types of attacks:
Untargeted attack: successful if f (x ) 6= y0(true label) Targeted attack: successful if f (x ) = t (t: target prediction) Ideally, attackers want to minimize kx − x0k
⇒ small distortion, hard to be detected by human
White Box Setting
Assume attacker has full knowledge of bothnetwork structure and weightsof neural network (know fW(·))
Based on this, attacker can compute
∇xfW(x ) using back-propagation.
Gradient measures the sensitivity of the prediction w.r.t input.
First attack algorithm: Gradient Sign Method
Proposed by (Goodfellow et al., 2015) x = x0+ sign(∇xloss(fw(x ), y0)) Why? Aim tomaximizeloss(fw(x ), y0)
Formulate attack as an optimization problem
Input image: x0 ∈ Rp Adversarial image: x ∈ Rp
minimizex kx − x0k22+ c · g (x ) subject to x ∈ [0, 1]p,
kx − x0k22 measures the L2 distortion
g (x ) is some loss to measure how successful the attack is c ≥ 0 controls the trade-off
Carlini & Wagner’s (C&W, 2017) Attack
Carlini & Wagner propose to use the following loss:
targeted :g (x , t) = max{max
i 6=t[Z (x )]i − [Z (x)]t, 0}, untargeted :g (x ) = max{Z [(x )]y − max
i 6=y[Z (x )]i, 0}
where Z (x ) ∈ RK is the logit layer output or probability output Minimize by gradient descent
Almost 100% success rate for CNN
Can be generalized to other problems by defining a loss function
Example: Attack Image Captioning
Image captioning: usually by CNN + RNN model
(From Show-and-Tell, 2014)
Example: Attack Image Captioning
Targeted attack Keywords attack
Output a specified sentence output contains specified keywords Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning.
By Chen, Zhang, Chen, Yi, Hsieh, 2017.
Success Rate and Distortion
Github link:
https://github.com/huanzhang12/ImageCaptioningAttack
Black Box Attacks by Zeroth Order Optimization
(AISec ’17)
Black Box Attacks
In practice, the deep network structure/parameters are not revealed to attackers
Black-box setting:
Deep network is not revealed to attackers.
Therefore, no back-propagation is available can’t compute ∇xfW(x )
But attackers can make queries and get the logit-layer output (probability output).
Attack settings: White and Black Box
Existing method for black-box attack
The only existing method: Substitute Model (Papernot et al., 2016) Main idea:
Make a lot of queries {xi}Ni =1and collect corresponding outputs:
yi= fw(xi) ∀i = 1, · · · , N Train asubstitute model ˆf to fit the input-output pairs:
f (xˆ i) ≈ yi, ∀i = 1, · · · , N Generate adversarial example x by white box attack tofˆ.
(gradient of ˆf can be computed) Use x to attack the original model f
Weaknesses of substitute model
Hard to have ˆf = f because Network structure is unknown
Even if network structure is known, deep learning is highly non-convex
⇒ SGD never converges to the same solution
Success rate is very low even if network structure is known.
Table:MNIST and CIFAR-10 Attack Comparison
MNIST
Untargeted Targeted
Success Rate Avg. L2 Avg. Time (per attack) Success Rate Avg. L2 Avg. Time (per attack)
White-box (C&W) 100 % 1.48066 0.48 min 100 % 2.00661 0.53 min
Black-box (Substitute Model + FGSM) 40.6 % - 0.002 sec (+ 6.16 min) 7.48 % - 0.002 sec (+ 6.16 min) Black-box (Substitute Model + C&W) 33.3 % 3.6111 0.76 min (+ 6.16 min) 26.74 % 5.272 0.80 min (+ 6.16 min)
CIFAR10
Untargeted Targeted
Success Rate Avg. L2 Avg. Time (per attack) Success Rate Avg. L2 Avg. Time (per attack)
White-box (C&W) 100 % 0.17980 0.20 min 100 % 0.37974 0.16 min
Black-box (Substitute Model + FGSM) 76.1 % - 0.005 sec (+ 7.81 min) 11.48 % - 0.005 sec (+ 7.81 min) Black-box (Substitute Model + C&W) 25.3 % 2.9708 0.47 min (+ 7.81 min) 5.3 % 5.7439 0.49 min (+ 7.81 min)
ZOO (Our proposed black box attack)
Input image: x0, adversarial image: x , target class label: t. Define the following optimization problem:
minimizex kx − x0k22+ c · f (x , t) subject to x ∈ [0, 1]p,
The same formulation with (C&W, 2017) The loss function:
f (x , t) = max{max
i 6=t log[F (x )]i − log[F (x)]t, 0}, where F (x ) ∈ RK is the blackbox output (probabilities) Cannot run gradient descent since we can’t compute ∇xF (x )
Zeroth Order Optimization
Access to F (x ) only, no ∇F (x ) available.
Estimate gradient ˆgi using the symmetric difference quotient:
ˆ
gi ≈ ∂F (x )
∂xi ≈ F (x + ei) − F (x − ei)
2 ,
Hessian ˆhi can also be estimated with only one additional query:
hˆi ≈ ∂2F (x )
∂xii2 ≈ F (x + ei) − 2F (x ) + F (x − ei)
2 .
Challenges
Large number of variables (pixels) for large images Very expensive to estimate the full gradient!
For an ImageNet quality image, the resolution is 299 × 299 × 3, which needs 536, 406 evaluations to estimate its gradient and run one iteration of gradient descent.
Black-box attack by Coordinate Descent
Algorithm 1 Stochastic Coordinate Descent 1:while not converged do
2: Randomly pick a batch of coordinates B ∈ {1, . . . , p}
3: For all i ∈ B
xi← xi− η ˆgi/ˆhi
where ˆgi=f (x + ei) − f (x − ei) 2
hˆi=f (x + ei) − 2f (x ) + f (x − ei)
2 4:end while
Update without estimating the gradient of all pixels In practice we choose B = 128 (with one GPU) ZOO-Adam: replace the inner update by Adam
Attack-space Dimension Reduction
Attack-space is the image space that we search for adversarial noise.
Instead of searching in Rp, we can search in a smaller space (with less pixels) using dimension reduction techniques.
This greatly reduces the number of pixels to optimize and make ZOO attack practical for large images.
Attack-space Dimension Reduction
For images, downscaling is the easiest way.
Only noise is down-scaled. No changes to the input image.
Hierarchical Attack
Gradually increase the dimension of attack space after some iterations.
32 × 32 → 64 × 64 → 128 × 128
0 5 10 15 20 25 30
x pixel coordinate
0 5 10 15 20 25 30
y pixel coordinate
32 × 32
0.3 0.2 0.1 0.0 0.1 0.2
0 10 20 30 40 50 60
x pixel coordinate
0 10 20 30 40 50 60
y pixel coordinate
64 × 64
0.3 0.2 0.1 0.0 0.1 0.2
0 20 40 60 80 100 120
x pixel coordinate
0 20 40 60 80 100 120
y pixel coordinate
128 × 128
0.3 0.2 0.1 0.0 0.1 0.2
Targeted Attack on MNIST
(a)
(b) White-box C&W
(c)Black-box ZOO-ADAM
(d)Black-box ZOO-Newton Figure:Visual comparison of successful adversarial examples in MNIST. Each row displays crafted adversarial examples from the sampled images in (a). Each column in (b) to (d) indexes the targeted class for attack.
Targeted Attack on CIFAR-10
(a)
(b)White-box C&W
(c) Black-box ZOO-ADAM
(d) Black-box ZOO-Newton
Figure:Visual comparison of successful adversarial examples in CIFAR10. Each row displays crafted adversarial examples from the sampled images in (a). Each column in (b) to (d) indexes the targeted class for attack.
Targeted Attack on MNIST & CIFAR10
Success rate close to white-box (C & W) attack, very close to 100%. L2 distortions are also very similar. Attack time is reasonable.
Table:MNIST and CIFAR-10 Attack Comparison
MNIST
Untargeted Targeted
Success Rate Avg. L2 Avg. Time (per attack) Success Rate Avg. L2 Avg. Time (per attack)
White-box (C&W) 100 % 1.48066 0.48 min 100 % 2.00661 0.53 min
Black-box (Substitute Model + FGSM) 40.6 % - 0.002 sec (+ 6.16 min) 7.48 % - 0.002 sec (+ 6.16 min) Black-box (Substitute Model + C&W) 33.3 % 3.6111 0.76 min (+ 6.16 min) 26.74 % 5.272 0.80 min (+ 6.16 min)
Proposed black-box (ZOO-ADAM) 100 % 1.49550 1.38 min 98.9 % 1.987068 1.62 min
Proposed black-box (ZOO-Newton) 100 % 1.51502 2.75 min 98.9 % 2.057264 2.06 min
CIFAR10
Untargeted Targeted
Success Rate Avg. L2 Avg. Time (per attack) Success Rate Avg. L2 Avg. Time (per attack)
White-box (C&W) 100 % 0.17980 0.20 min 100 % 0.37974 0.16 min
Black-box (Substitute Model + FGSM) 76.1 % - 0.005 sec (+ 7.81 min) 11.48 % - 0.005 sec (+ 7.81 min) Black-box (Substitute Model + C&W) 25.3 % 2.9708 0.47 min (+ 7.81 min) 5.3 % 5.7439 0.49 min (+ 7.81 min)
Proposed Black-box (ZOO-ADAM) 100 % 0.19973 3.43 min 96.8 % 0.39879 3.95 min
Proposed Black-box (ZOO-Newton) 100 % 0.23554 4.41 min 97.0 % 0.54226 4.40 min
Untargeted Attack on Inception-v3
We run black-box attacks to 150 ImageNet test images (size 299 × 299 × 3).
Each attack is set to run only 2, 000 iterations (within 20 minutes).
Table:Untargeted ImageNet Attack Comparison.
Success Rate Avg. L
2White-box (C&W) 100 % 0.37310
Proposed black-box (ZOO-ADAM) 88.9 % 1.19916
Untargeted Attack on Inception-v3
Figure:ImageNet untargeted attack examples
Targeted Attack on Inception-v3
Targeted Attack on Inception-v3
Needs 20,000 iterations to perform this hard targeted attack (about 4 hours)
Defense by Random Self Ensemble
(Arxiv ’17)
Defense from Adversarial Attacks
Adversarial retraining (Goodfellow et al., 2015):
Adding adversarial examples into training set
Robust optimization + BReLu (Zantedeschi et al., 2017) minw E(x ,y )∼DE∆x ∼N(0,σ2)loss(fw(x + ∆x ), y ) Security community: detect adversarial examples
MagNet: (Meng and Chen, 2017), (Li and Li, 2017), · · ·
However, they are still vulnerable if we choose a better way to attack (Carlini and Wagner, 2017,MagNet and “Efficient Defenses Against Adversarial Attacks” are Not Robust to Adversarial Examples)
Adding randomness to the model
All the attacks are based ongradient computation Idea I: Adding randomnessto fool the attackers
f (x ) → f(x ) where is random Resistant to Black-box attack: gradient cannot be estimated
f1(x + hei) − f2(x − hei)
2h 6≈ ∇if (x )
Idea II: Ensemble can improve the robustness of model: ensemble T models f1(x ), · · · , fT(x ) Our approach: Random + Ensemble
Adding randomness to the model
All the attacks are based ongradient computation Idea I: Adding randomnessto fool the attackers
f (x ) → f(x ) where is random Resistant to Black-box attack: gradient cannot be estimated
f1(x + hei) − f2(x − hei)
2h 6≈ ∇if (x )
Idea II: Ensemble can improve the robustness of model:
ensemble T models f1(x ), · · · , fT(x ) Our approach: Random + Ensemble
Adding randomness to the model
Naive approach: adding randomness only in the testing phase prediction accuracy 87% → 20%(Cifar-10 with VGG)
Our approach (RSE)
Prediction: Ensemble of random models p =
T
X
j =1
fj(w , x ), and predict ˆy = arg max
k pk Training: minimizing an upper bound
E(x ,y )∼DE∼Nloss(f(x ), y )≥ E(x ,y )∼Dloss(E∼Nf(x ), y ) Training by SGD: sample (x , y ), and conduct
w ← w − η∇wloss(f(x ), y )
Implicit regularizing Lipchitz constant:
E∼N(0,σ2)loss(f(w , xi), yi) ≈ loss(f0(w , xi), yi) +σ2 2 L`·f0,
Our approach (RSE)
Prediction: Ensemble of random models p =
T
X
j =1
fj(w , x ), and predict ˆy = arg max
k pk Training: minimizing an upper bound
E(x ,y )∼DE∼Nloss(f(x ), y )≥ E(x ,y )∼Dloss(E∼Nf(x ), y ) Training by SGD: sample (x , y ), and conduct
w ← w − η∇wloss(f(x ), y ) Implicit regularizing Lipchitz constant:
E∼N(0,σ2)loss(f(w , xi), yi) ≈ loss(f0(w , xi), yi) +σ2 2 L`·f0,
Experiments
We fix σinit = 0.4 and σinner = 0.1 for all the datasets/networks
102 101 100 101 102
C 0
20 40 60
Accuracy (%)
STL-10 + ModelA
No defense Adv retraining Robust Opt+BReLU RSEDistill
102 101 100 101 102
C 0.0
0.5 1.0 1.5 2.0 2.5 3.0
Avg. Distortion
No defense Adv retraining Robust Opt+BReLU RSEDistill
102 101 100 101 102
C 0
20 40 60 80
CIFAR10 + VGG16
No defense Adv retraining Robust Opt+BReLU RSEDistill
102 101 100 101 102
C 0.0
0.2 0.4 0.6 0.8 1.0
1.2 No defense Adv retraining Robust Opt+BReLU RSEDistill
102 101 100 101 102
C 0
20 40 60 80
CIFAR10 + ResNeXt
No defense Adv retraining Robust Opt+BReLU RSEDistill
102 101 100 101 102
C 0.0
0.2 0.4 0.6 0.8
1.0 No defense Adv retraining Robust Opt+BReLU RSEDistill
Targeted attack
Original image:
Targeted attack
Conclusions and Discussions
Attack is easy in both black-box and white-box settings back-door attack, one-pixel attack, · · ·
Defense is hard:
Randomness and Ensemble are useful
Robust optimization (under different loss functions, distributions) Directly regularize Lipschitz constant ⇒ not successful, why?
Rethink which model should we use Simple models or complex models?
Or some combination of them?
Thank You!
References
[1] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, C.-J. Hsieh.ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks without Training Substitute Models. AISec, 2017.
[2] H. Chen, H. Zhang, P.-Y. Chen, J. Yi, C.-J. Hsieh. Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning. Arxiv 2017.
[3] X. Liu, M. Cheng, H. Zhang, C.-J. Hsieh. Towards Robust Neural Networks via Random Self-ensemble. Arxiv, 2017.
[4] P.-Y. Chen, Y. Sharma, H. Zhang, J. Yi, C.-J. Hsieh.EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples. AAAI 2018.
[5] T.-W. Weng, H. Zhang, P.-Y. Chen, D. Su, Y. Gao, J. Yi, C.-J. Hsieh, L. Daniel. Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach. 2017.
[6] N. Carlini, D. Wagner.Towards evaluating the robustness of neural networks. SC 2017.
[7] I. Goodfellow, J. Shlens, C. Szegedy. Explaining and Harnessing Adversarial Examples. ICLR 2015.
[8] V. Zantedeschi, M. Nicolae, A. Rawat. Efficient Defenses Against Adversarial Attacks. Arxiv, 2017.
[9] D. Meng, H. Chen. MagNet: a Two-Pronged Defense against Adversarial Examples. CCS, 2017.
[10] N. Carlini, D. Wagner.MagNet and Efficient Defenses Against Adversarial Attacks are Not Robust to Adversarial Examples. Arxiv, 2017.