### Security for Deep Neural Networks

Cho-Jui Hsieh

Depts of Computer Science and Statistics UC Davis

NTU Machine Learning Symposium Dec 23, 2017

Joint work with Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Yash Sharma, Hongge Chen, Xuanqing Liu, Tsui-Wei Weng, Minhao Cheng

### Outline

Attack algorithms Review recent works

ZOO: Zeroth Order Optimization for black-box attack Defensive algorithms:

Random Self Ensemble for defense.

### Adversarial Examples

Given:

Targeted model f (·)

Targeted example x_{0}(correctly classified)
Adversarial example:

Another example x close to x0 but f (x ) 6= f (x0)

Classification Image captioning

### Can this happen in real world?

Stop sign mis-classified as speed limit:

“Robust Physical-World Attacks on Deep Learning Models”, 2017.

Real world adversarial turtle:

“Synthesizing Robust Adversarial Examples”, 2017.

(see https://youtu.be/qPxlhGSG0tc)

## Attack ML models

### Threat model

Given:

Targeted model f (·)

Targeted example x_{0}(correctly classified by f , so f (x_{0}) = y_{0})
Attacker constructs an adversarial example x that isclose to x_{0}
Two types of attacks:

Untargeted attack: successful if f (x ) 6= y_{0}(true label)
Targeted attack: successful if f (x ) = t (t: target prediction)
Ideally, attackers want to minimize kx − x_{0}k

⇒ small distortion, hard to be detected by human

### White Box Setting

Assume attacker has full knowledge of bothnetwork structure and
weightsof neural network (know f_{W}(·))

Based on this, attacker can compute

∇_{x}fW(x )
using back-propagation.

Gradient measures the sensitivity of the prediction w.r.t input.

### First attack algorithm: Gradient Sign Method

Proposed by (Goodfellow et al., 2015)
x = x_{0}+ sign(∇xloss(fw(x ), y_{0}))
Why? Aim tomaximizeloss(fw(x ), y0)

### Formulate attack as an optimization problem

Input image: x0 ∈ R^{p}
Adversarial image: x ∈ R^{p}

minimizex kx − x_{0}k^{2}_{2}+ c · g (x )
subject to x ∈ [0, 1]^{p},

kx − x_{0}k^{2}_{2} measures the L2 distortion

g (x ) is some loss to measure how successful the attack is c ≥ 0 controls the trade-off

### Carlini & Wagner’s (C&W, 2017) Attack

Carlini & Wagner propose to use the following loss:

targeted :g (x , t) = max{max

i 6=t[Z (x )]i − [Z (x)]_{t}, 0},
untargeted :g (x ) = max{Z [(x )]y − max

i 6=y[Z (x )]i, 0}

where Z (x ) ∈ R^{K} is the logit layer output or probability output
Minimize by gradient descent

Almost 100% success rate for CNN

Can be generalized to other problems by defining a loss function

### Example: Attack Image Captioning

Image captioning: usually by CNN + RNN model

(From Show-and-Tell, 2014)

### Example: Attack Image Captioning

Targeted attack Keywords attack

Output a specified sentence output contains specified keywords Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning.

By Chen, Zhang, Chen, Yi, Hsieh, 2017.

### Success Rate and Distortion

Github link:

https://github.com/huanzhang12/ImageCaptioningAttack

### Black Box Attacks by Zeroth Order Optimization

(AISec ’17)

### Black Box Attacks

In practice, the deep network structure/parameters are not revealed to attackers

Black-box setting:

Deep network is not revealed to attackers.

Therefore, no back-propagation is available
can’t compute ∇_{x}f_{W}(x )

But attackers can make queries and get the logit-layer output (probability output).

### Attack settings: White and Black Box

### Existing method for black-box attack

The only existing method: Substitute Model (Papernot et al., 2016) Main idea:

Make a lot of queries {x_{i}}^{N}_{i =1}and collect corresponding outputs:

yi= f_{w}(xi) ∀i = 1, · · · , N
Train asubstitute model ˆf to fit the input-output pairs:

f (xˆ i) ≈ yi, ∀i = 1, · · · , N Generate adversarial example x by white box attack tofˆ.

(gradient of ˆf can be computed) Use x to attack the original model f

### Weaknesses of substitute model

Hard to have ˆf = f because Network structure is unknown

Even if network structure is known, deep learning is highly non-convex

⇒ SGD never converges to the same solution

Success rate is very low even if network structure is known.

Table:MNIST and CIFAR-10 Attack Comparison

MNIST

Untargeted Targeted

Success Rate Avg. L2 Avg. Time (per attack) Success Rate Avg. L2 Avg. Time (per attack)

White-box (C&W) 100 % 1.48066 0.48 min 100 % 2.00661 0.53 min

Black-box (Substitute Model + FGSM) 40.6 % - 0.002 sec (+ 6.16 min) 7.48 % - 0.002 sec (+ 6.16 min) Black-box (Substitute Model + C&W) 33.3 % 3.6111 0.76 min (+ 6.16 min) 26.74 % 5.272 0.80 min (+ 6.16 min)

CIFAR10

Untargeted Targeted

Success Rate Avg. L2 Avg. Time (per attack) Success Rate Avg. L2 Avg. Time (per attack)

White-box (C&W) 100 % 0.17980 0.20 min 100 % 0.37974 0.16 min

Black-box (Substitute Model + FGSM) 76.1 % - 0.005 sec (+ 7.81 min) 11.48 % - 0.005 sec (+ 7.81 min) Black-box (Substitute Model + C&W) 25.3 % 2.9708 0.47 min (+ 7.81 min) 5.3 % 5.7439 0.49 min (+ 7.81 min)

### ZOO (Our proposed black box attack)

Input image: x0, adversarial image: x , target class label: t. Define the following optimization problem:

minimizex kx − x_{0}k^{2}_{2}+ c · f (x , t)
subject to x ∈ [0, 1]^{p},

The same formulation with (C&W, 2017) The loss function:

f (x , t) = max{max

i 6=t log[F (x )]_{i} − log[F (x)]_{t}, 0},
where F (x ) ∈ R^{K} is the blackbox output (probabilities)
Cannot run gradient descent since we can’t compute ∇xF (x )

### Zeroth Order Optimization

Access to F (x ) only, no ∇F (x ) available.

Estimate gradient ˆg_{i} using the symmetric difference quotient:

ˆ

gi ≈ ∂F (x )

∂x_{i} ≈ F (x + ei) − F (x − ei)

2 ,

Hessian ˆhi can also be estimated with only one additional query:

hˆ_{i} ≈ ∂^{2}F (x )

∂x_{ii}^{2} ≈ F (x + e_{i}) − 2F (x ) + F (x − e_{i})

^{2} .

### Challenges

Large number of variables (pixels) for large images Very expensive to estimate the full gradient!

For an ImageNet quality image, the resolution is 299 × 299 × 3, which needs 536, 406 evaluations to estimate its gradient and run one iteration of gradient descent.

### Black-box attack by Coordinate Descent

Algorithm 1 Stochastic Coordinate Descent 1:while not converged do

2: Randomly pick a batch of coordinates B ∈ {1, . . . , p}

3: For all i ∈ B

xi← xi− η ˆgi/ˆhi

where ˆgi=f (x + ei) − f (x − ei) 2

hˆi=f (x + ei) − 2f (x ) + f (x − ei)

^{2}
4:end while

Update without estimating the gradient of all pixels In practice we choose B = 128 (with one GPU) ZOO-Adam: replace the inner update by Adam

### Attack-space Dimension Reduction

Attack-space is the image space that we search for adversarial noise.

Instead of searching in R^{p}, we can search in a smaller space (with less
pixels) using dimension reduction techniques.

This greatly reduces the number of pixels to optimize and make ZOO attack practical for large images.

### Attack-space Dimension Reduction

For images, downscaling is the easiest way.

Only noise is down-scaled. No changes to the input image.

### Hierarchical Attack

Gradually increase the dimension of attack space after some iterations.

32 × 32 → 64 × 64 → 128 × 128

0 5 10 15 20 25 30

x pixel coordinate

0 5 10 15 20 25 30

y pixel coordinate

32 × 32

0.3 0.2 0.1 0.0 0.1 0.2

0 10 20 30 40 50 60

x pixel coordinate

0 10 20 30 40 50 60

y pixel coordinate

64 × 64

0.3 0.2 0.1 0.0 0.1 0.2

0 20 40 60 80 100 120

x pixel coordinate

0 20 40 60 80 100 120

y pixel coordinate

128 × 128

0.3 0.2 0.1 0.0 0.1 0.2

### Targeted Attack on MNIST

(a)

(b) White-box C&W

(c)Black-box ZOO-ADAM

(d)Black-box ZOO-Newton Figure:Visual comparison of successful adversarial examples in MNIST. Each row displays crafted adversarial examples from the sampled images in (a). Each column in (b) to (d) indexes the targeted class for attack.

### Targeted Attack on CIFAR-10

(a)

(b)White-box C&W

(c) Black-box ZOO-ADAM

(d) Black-box ZOO-Newton

Figure:Visual comparison of successful adversarial examples in CIFAR10. Each row displays crafted adversarial examples from the sampled images in (a). Each column in (b) to (d) indexes the targeted class for attack.

### Targeted Attack on MNIST & CIFAR10

Success rate close to white-box (C & W) attack, very close to 100%. L_{2}
distortions are also very similar. Attack time is reasonable.

Table:MNIST and CIFAR-10 Attack Comparison

MNIST

Untargeted Targeted

Success Rate Avg. L2 Avg. Time (per attack) Success Rate Avg. L2 Avg. Time (per attack)

White-box (C&W) 100 % 1.48066 0.48 min 100 % 2.00661 0.53 min

Black-box (Substitute Model + FGSM) 40.6 % - 0.002 sec (+ 6.16 min) 7.48 % - 0.002 sec (+ 6.16 min) Black-box (Substitute Model + C&W) 33.3 % 3.6111 0.76 min (+ 6.16 min) 26.74 % 5.272 0.80 min (+ 6.16 min)

Proposed black-box (ZOO-ADAM) 100 % 1.49550 1.38 min 98.9 % 1.987068 1.62 min

Proposed black-box (ZOO-Newton) 100 % 1.51502 2.75 min 98.9 % 2.057264 2.06 min

CIFAR10

Untargeted Targeted

Success Rate Avg. L2 Avg. Time (per attack) Success Rate Avg. L2 Avg. Time (per attack)

White-box (C&W) 100 % 0.17980 0.20 min 100 % 0.37974 0.16 min

Black-box (Substitute Model + FGSM) 76.1 % - 0.005 sec (+ 7.81 min) 11.48 % - 0.005 sec (+ 7.81 min) Black-box (Substitute Model + C&W) 25.3 % 2.9708 0.47 min (+ 7.81 min) 5.3 % 5.7439 0.49 min (+ 7.81 min)

Proposed Black-box (ZOO-ADAM) 100 % 0.19973 3.43 min 96.8 % 0.39879 3.95 min

Proposed Black-box (ZOO-Newton) 100 % 0.23554 4.41 min 97.0 % 0.54226 4.40 min

### Untargeted Attack on Inception-v3

We run black-box attacks to 150 ImageNet test images (size 299 × 299 × 3).

Each attack is set to run only 2, 000 iterations (within 20 minutes).

Table:Untargeted ImageNet Attack Comparison.

### Success Rate Avg. L

2### White-box (C&W) 100 % 0.37310

### Proposed black-box (ZOO-ADAM) 88.9 % 1.19916

### Untargeted Attack on Inception-v3

Figure:ImageNet untargeted attack examples

### Targeted Attack on Inception-v3

### Targeted Attack on Inception-v3

Needs 20,000 iterations to perform this hard targeted attack (about 4 hours)

### Defense by Random Self Ensemble

(Arxiv ’17)

### Defense from Adversarial Attacks

Adversarial retraining (Goodfellow et al., 2015):

Adding adversarial examples into training set

Robust optimization + BReLu (Zantedeschi et al., 2017)
minw E_{(x ,y )∼D}E_{∆x ∼N(0,σ}2)loss(f_{w}(x + ∆x ), y )
Security community: detect adversarial examples

MagNet: (Meng and Chen, 2017), (Li and Li, 2017), · · ·

However, they are still vulnerable if we choose a better way to attack (Carlini and Wagner, 2017,MagNet and “Efficient Defenses Against Adversarial Attacks” are Not Robust to Adversarial Examples)

### Adding randomness to the model

All the attacks are based ongradient computation Idea I: Adding randomnessto fool the attackers

f (x ) → f_{}(x ) where is random
Resistant to Black-box attack: gradient cannot be estimated

f1(x + hei) − f2(x − hei)

2h 6≈ ∇_{i}f (x )

Idea II: Ensemble can improve the robustness of model:
ensemble T models f1(x ), · · · , f_{T}(x )
Our approach: Random + Ensemble

### Adding randomness to the model

All the attacks are based ongradient computation Idea I: Adding randomnessto fool the attackers

f (x ) → f_{}(x ) where is random
Resistant to Black-box attack: gradient cannot be estimated

f1(x + hei) − f2(x − hei)

2h 6≈ ∇_{i}f (x )

Idea II: Ensemble can improve the robustness of model:

ensemble T models f_{}_{1}(x ), · · · , f_{}_{T}(x )
Our approach: Random + Ensemble

### Adding randomness to the model

Naive approach: adding randomness only in the testing phase prediction accuracy 87% → 20%(Cifar-10 with VGG)

### Our approach (RSE)

Prediction: Ensemble of random models p =

T

X

j =1

f_{j}(w , x ), and predict ˆy = arg max

k p_{k}
Training: minimizing an upper bound

E_{(x ,y )∼D}E_{∼N}loss(f_{}(x ), y )≥ E_{(x ,y )∼D}loss(E_{∼N}f_{}(x ), y )
Training by SGD: sample (x , y ), and conduct

w ← w − η∇wloss(f(x ), y )

Implicit regularizing Lipchitz constant:

E_{∼N(0,σ}^{2}_{)}loss(f(w , xi), yi) ≈ loss(f0(w , xi), yi) +σ^{2}
2 L`·f0,

### Our approach (RSE)

Prediction: Ensemble of random models p =

T

X

j =1

f_{j}(w , x ), and predict ˆy = arg max

k p_{k}
Training: minimizing an upper bound

E_{(x ,y )∼D}E_{∼N}loss(f_{}(x ), y )≥ E_{(x ,y )∼D}loss(E_{∼N}f_{}(x ), y )
Training by SGD: sample (x , y ), and conduct

w ← w − η∇wloss(f(x ), y ) Implicit regularizing Lipchitz constant:

E_{∼N(0,σ}^{2}_{)}loss(f(w , xi), yi) ≈ loss(f0(w , xi), yi) +σ^{2}
2 L`·f0,

### Experiments

We fix σinit = 0.4 and σinner = 0.1 for all the datasets/networks

10^{2} 10^{1} 10^{0} 10^{1} 10^{2}

C 0

20 40 60

Accuracy (%)

**STL-10 + ModelA**

No defense Adv retraining Robust Opt+BReLU RSEDistill

10^{2} 10^{1} 10^{0} 10^{1} 10^{2}

C 0.0

0.5 1.0 1.5 2.0 2.5 3.0

Avg. Distortion

No defense Adv retraining Robust Opt+BReLU RSEDistill

10^{2} 10^{1} 10^{0} 10^{1} 10^{2}

C 0

20 40 60 80

**CIFAR10 + VGG16**

No defense Adv retraining Robust Opt+BReLU RSEDistill

10^{2} 10^{1} 10^{0} 10^{1} 10^{2}

C 0.0

0.2 0.4 0.6 0.8 1.0

1.2 No defense Adv retraining Robust Opt+BReLU RSEDistill

10^{2} 10^{1} 10^{0} 10^{1} 10^{2}

C 0

20 40 60 80

**CIFAR10 + ResNeXt**

No defense Adv retraining Robust Opt+BReLU RSEDistill

10^{2} 10^{1} 10^{0} 10^{1} 10^{2}

C 0.0

0.2 0.4 0.6 0.8

1.0 No defense Adv retraining Robust Opt+BReLU RSEDistill

### Targeted attack

Original image:

Targeted attack

### Conclusions and Discussions

Attack is easy in both black-box and white-box settings back-door attack, one-pixel attack, · · ·

Defense is hard:

Randomness and Ensemble are useful

Robust optimization (under different loss functions, distributions) Directly regularize Lipschitz constant ⇒ not successful, why?

Rethink which model should we use Simple models or complex models?

Or some combination of them?

## Thank You!

### References

[1] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, C.-J. Hsieh.ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks without Training Substitute Models. AISec, 2017.

[2] H. Chen, H. Zhang, P.-Y. Chen, J. Yi, C.-J. Hsieh. Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning. Arxiv 2017.

[3] X. Liu, M. Cheng, H. Zhang, C.-J. Hsieh. Towards Robust Neural Networks via Random Self-ensemble. Arxiv, 2017.

[4] P.-Y. Chen, Y. Sharma, H. Zhang, J. Yi, C.-J. Hsieh.EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples. AAAI 2018.

[5] T.-W. Weng, H. Zhang, P.-Y. Chen, D. Su, Y. Gao, J. Yi, C.-J. Hsieh, L. Daniel. Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach. 2017.

[6] N. Carlini, D. Wagner.Towards evaluating the robustness of neural networks. SC 2017.

[7] I. Goodfellow, J. Shlens, C. Szegedy. Explaining and Harnessing Adversarial Examples. ICLR 2015.

[8] V. Zantedeschi, M. Nicolae, A. Rawat. Efficient Defenses Against Adversarial Attacks. Arxiv, 2017.

[9] D. Meng, H. Chen. MagNet: a Two-Pronged Defense against Adversarial Examples. CCS, 2017.

[10] N. Carlini, D. Wagner.MagNet and Efficient Defenses Against Adversarial Attacks are Not Robust to Adversarial Examples. Arxiv, 2017.