• 沒有找到結果。

Security for Deep Neural Networks

N/A
N/A
Protected

Academic year: 2022

Share "Security for Deep Neural Networks"

Copied!
43
0
0

加載中.... (立即查看全文)

全文

(1)

Security for Deep Neural Networks

Cho-Jui Hsieh

Depts of Computer Science and Statistics UC Davis

NTU Machine Learning Symposium Dec 23, 2017

Joint work with Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Yash Sharma, Hongge Chen, Xuanqing Liu, Tsui-Wei Weng, Minhao Cheng

(2)

Outline

Attack algorithms Review recent works

ZOO: Zeroth Order Optimization for black-box attack Defensive algorithms:

Random Self Ensemble for defense.

(3)

Adversarial Examples

Given:

Targeted model f (·)

Targeted example x0(correctly classified) Adversarial example:

Another example x close to x0 but f (x ) 6= f (x0)

Classification Image captioning

(4)

Can this happen in real world?

Stop sign mis-classified as speed limit:

“Robust Physical-World Attacks on Deep Learning Models”, 2017.

Real world adversarial turtle:

“Synthesizing Robust Adversarial Examples”, 2017.

(see https://youtu.be/qPxlhGSG0tc)

(5)

Attack ML models

(6)

Threat model

Given:

Targeted model f (·)

Targeted example x0(correctly classified by f , so f (x0) = y0) Attacker constructs an adversarial example x that isclose to x0 Two types of attacks:

Untargeted attack: successful if f (x ) 6= y0(true label) Targeted attack: successful if f (x ) = t (t: target prediction) Ideally, attackers want to minimize kx − x0k

⇒ small distortion, hard to be detected by human

(7)

White Box Setting

Assume attacker has full knowledge of bothnetwork structure and weightsof neural network (know fW(·))

Based on this, attacker can compute

xfW(x ) using back-propagation.

Gradient measures the sensitivity of the prediction w.r.t input.

(8)

First attack algorithm: Gradient Sign Method

Proposed by (Goodfellow et al., 2015) x = x0+ sign(∇xloss(fw(x ), y0)) Why? Aim tomaximizeloss(fw(x ), y0)

(9)

Formulate attack as an optimization problem

Input image: x0 ∈ Rp Adversarial image: x ∈ Rp

minimizex kx − x0k22+ c · g (x ) subject to x ∈ [0, 1]p,

kx − x0k22 measures the L2 distortion

g (x ) is some loss to measure how successful the attack is c ≥ 0 controls the trade-off

(10)

Carlini & Wagner’s (C&W, 2017) Attack

Carlini & Wagner propose to use the following loss:

targeted :g (x , t) = max{max

i 6=t[Z (x )]i − [Z (x)]t, 0}, untargeted :g (x ) = max{Z [(x )]y − max

i 6=y[Z (x )]i, 0}

where Z (x ) ∈ RK is the logit layer output or probability output Minimize by gradient descent

Almost 100% success rate for CNN

Can be generalized to other problems by defining a loss function

(11)

Example: Attack Image Captioning

Image captioning: usually by CNN + RNN model

(From Show-and-Tell, 2014)

(12)

Example: Attack Image Captioning

Targeted attack Keywords attack

Output a specified sentence output contains specified keywords Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning.

By Chen, Zhang, Chen, Yi, Hsieh, 2017.

(13)

Success Rate and Distortion

Github link:

https://github.com/huanzhang12/ImageCaptioningAttack

(14)

Black Box Attacks by Zeroth Order Optimization

(AISec ’17)

(15)

Black Box Attacks

In practice, the deep network structure/parameters are not revealed to attackers

Black-box setting:

Deep network is not revealed to attackers.

Therefore, no back-propagation is available can’t compute ∇xfW(x )

But attackers can make queries and get the logit-layer output (probability output).

(16)

Attack settings: White and Black Box

(17)

Existing method for black-box attack

The only existing method: Substitute Model (Papernot et al., 2016) Main idea:

Make a lot of queries {xi}Ni =1and collect corresponding outputs:

yi= fw(xi) ∀i = 1, · · · , N Train asubstitute model ˆf to fit the input-output pairs:

f (xˆ i) ≈ yi, ∀i = 1, · · · , N Generate adversarial example x by white box attack tofˆ.

(gradient of ˆf can be computed) Use x to attack the original model f

(18)

Weaknesses of substitute model

Hard to have ˆf = f because Network structure is unknown

Even if network structure is known, deep learning is highly non-convex

⇒ SGD never converges to the same solution

Success rate is very low even if network structure is known.

Table:MNIST and CIFAR-10 Attack Comparison

MNIST

Untargeted Targeted

Success Rate Avg. L2 Avg. Time (per attack) Success Rate Avg. L2 Avg. Time (per attack)

White-box (C&W) 100 % 1.48066 0.48 min 100 % 2.00661 0.53 min

Black-box (Substitute Model + FGSM) 40.6 % - 0.002 sec (+ 6.16 min) 7.48 % - 0.002 sec (+ 6.16 min) Black-box (Substitute Model + C&W) 33.3 % 3.6111 0.76 min (+ 6.16 min) 26.74 % 5.272 0.80 min (+ 6.16 min)

CIFAR10

Untargeted Targeted

Success Rate Avg. L2 Avg. Time (per attack) Success Rate Avg. L2 Avg. Time (per attack)

White-box (C&W) 100 % 0.17980 0.20 min 100 % 0.37974 0.16 min

Black-box (Substitute Model + FGSM) 76.1 % - 0.005 sec (+ 7.81 min) 11.48 % - 0.005 sec (+ 7.81 min) Black-box (Substitute Model + C&W) 25.3 % 2.9708 0.47 min (+ 7.81 min) 5.3 % 5.7439 0.49 min (+ 7.81 min)

(19)

ZOO (Our proposed black box attack)

Input image: x0, adversarial image: x , target class label: t. Define the following optimization problem:

minimizex kx − x0k22+ c · f (x , t) subject to x ∈ [0, 1]p,

The same formulation with (C&W, 2017) The loss function:

f (x , t) = max{max

i 6=t log[F (x )]i − log[F (x)]t, 0}, where F (x ) ∈ RK is the blackbox output (probabilities) Cannot run gradient descent since we can’t compute ∇xF (x )

(20)

Zeroth Order Optimization

Access to F (x ) only, no ∇F (x ) available.

Estimate gradient ˆgi using the symmetric difference quotient:

ˆ

gi ≈ ∂F (x )

∂xi ≈ F (x + ei) − F (x − ei)

2 ,

Hessian ˆhi can also be estimated with only one additional query:

i ≈ ∂2F (x )

∂xii2 ≈ F (x + ei) − 2F (x ) + F (x − ei)

2 .

(21)

Challenges

Large number of variables (pixels) for large images Very expensive to estimate the full gradient!

For an ImageNet quality image, the resolution is 299 × 299 × 3, which needs 536, 406 evaluations to estimate its gradient and run one iteration of gradient descent.

(22)

Black-box attack by Coordinate Descent

Algorithm 1 Stochastic Coordinate Descent 1:while not converged do

2: Randomly pick a batch of coordinates B ∈ {1, . . . , p}

3: For all i ∈ B

xi← xi− η ˆgihi

where ˆgi=f (x + ei) − f (x − ei) 2

hˆi=f (x + ei) − 2f (x ) + f (x − ei)

2 4:end while

Update without estimating the gradient of all pixels In practice we choose B = 128 (with one GPU) ZOO-Adam: replace the inner update by Adam

(23)

Attack-space Dimension Reduction

Attack-space is the image space that we search for adversarial noise.

Instead of searching in Rp, we can search in a smaller space (with less pixels) using dimension reduction techniques.

This greatly reduces the number of pixels to optimize and make ZOO attack practical for large images.

(24)

Attack-space Dimension Reduction

For images, downscaling is the easiest way.

Only noise is down-scaled. No changes to the input image.

(25)

Hierarchical Attack

Gradually increase the dimension of attack space after some iterations.

32 × 32 → 64 × 64 → 128 × 128

0 5 10 15 20 25 30

x pixel coordinate

0 5 10 15 20 25 30

y pixel coordinate

32 × 32

0.3 0.2 0.1 0.0 0.1 0.2

0 10 20 30 40 50 60

x pixel coordinate

0 10 20 30 40 50 60

y pixel coordinate

64 × 64

0.3 0.2 0.1 0.0 0.1 0.2

0 20 40 60 80 100 120

x pixel coordinate

0 20 40 60 80 100 120

y pixel coordinate

128 × 128

0.3 0.2 0.1 0.0 0.1 0.2

(26)

Targeted Attack on MNIST

(a)

(b) White-box C&W

(c)Black-box ZOO-ADAM

(d)Black-box ZOO-Newton Figure:Visual comparison of successful adversarial examples in MNIST. Each row displays crafted adversarial examples from the sampled images in (a). Each column in (b) to (d) indexes the targeted class for attack.

(27)

Targeted Attack on CIFAR-10

(a)

(b)White-box C&W

(c) Black-box ZOO-ADAM

(d) Black-box ZOO-Newton

Figure:Visual comparison of successful adversarial examples in CIFAR10. Each row displays crafted adversarial examples from the sampled images in (a). Each column in (b) to (d) indexes the targeted class for attack.

(28)

Targeted Attack on MNIST & CIFAR10

Success rate close to white-box (C & W) attack, very close to 100%. L2 distortions are also very similar. Attack time is reasonable.

Table:MNIST and CIFAR-10 Attack Comparison

MNIST

Untargeted Targeted

Success Rate Avg. L2 Avg. Time (per attack) Success Rate Avg. L2 Avg. Time (per attack)

White-box (C&W) 100 % 1.48066 0.48 min 100 % 2.00661 0.53 min

Black-box (Substitute Model + FGSM) 40.6 % - 0.002 sec (+ 6.16 min) 7.48 % - 0.002 sec (+ 6.16 min) Black-box (Substitute Model + C&W) 33.3 % 3.6111 0.76 min (+ 6.16 min) 26.74 % 5.272 0.80 min (+ 6.16 min)

Proposed black-box (ZOO-ADAM) 100 % 1.49550 1.38 min 98.9 % 1.987068 1.62 min

Proposed black-box (ZOO-Newton) 100 % 1.51502 2.75 min 98.9 % 2.057264 2.06 min

CIFAR10

Untargeted Targeted

Success Rate Avg. L2 Avg. Time (per attack) Success Rate Avg. L2 Avg. Time (per attack)

White-box (C&W) 100 % 0.17980 0.20 min 100 % 0.37974 0.16 min

Black-box (Substitute Model + FGSM) 76.1 % - 0.005 sec (+ 7.81 min) 11.48 % - 0.005 sec (+ 7.81 min) Black-box (Substitute Model + C&W) 25.3 % 2.9708 0.47 min (+ 7.81 min) 5.3 % 5.7439 0.49 min (+ 7.81 min)

Proposed Black-box (ZOO-ADAM) 100 % 0.19973 3.43 min 96.8 % 0.39879 3.95 min

Proposed Black-box (ZOO-Newton) 100 % 0.23554 4.41 min 97.0 % 0.54226 4.40 min

(29)

Untargeted Attack on Inception-v3

We run black-box attacks to 150 ImageNet test images (size 299 × 299 × 3).

Each attack is set to run only 2, 000 iterations (within 20 minutes).

Table:Untargeted ImageNet Attack Comparison.

Success Rate Avg. L

2

White-box (C&W) 100 % 0.37310

Proposed black-box (ZOO-ADAM) 88.9 % 1.19916

(30)

Untargeted Attack on Inception-v3

Figure:ImageNet untargeted attack examples

(31)

Targeted Attack on Inception-v3

(32)

Targeted Attack on Inception-v3

Needs 20,000 iterations to perform this hard targeted attack (about 4 hours)

(33)

Defense by Random Self Ensemble

(Arxiv ’17)

(34)

Defense from Adversarial Attacks

Adversarial retraining (Goodfellow et al., 2015):

Adding adversarial examples into training set

Robust optimization + BReLu (Zantedeschi et al., 2017) minw E(x ,y )∼DE∆x ∼N(0,σ2)loss(fw(x + ∆x ), y ) Security community: detect adversarial examples

MagNet: (Meng and Chen, 2017), (Li and Li, 2017), · · ·

However, they are still vulnerable if we choose a better way to attack (Carlini and Wagner, 2017,MagNet and “Efficient Defenses Against Adversarial Attacks” are Not Robust to Adversarial Examples)

(35)

Adding randomness to the model

All the attacks are based ongradient computation Idea I: Adding randomnessto fool the attackers

f (x ) → f(x ) where  is random Resistant to Black-box attack: gradient cannot be estimated

f1(x + hei) − f2(x − hei)

2h 6≈ ∇if (x )

Idea II: Ensemble can improve the robustness of model: ensemble T models f1(x ), · · · , fT(x ) Our approach: Random + Ensemble

(36)

Adding randomness to the model

All the attacks are based ongradient computation Idea I: Adding randomnessto fool the attackers

f (x ) → f(x ) where  is random Resistant to Black-box attack: gradient cannot be estimated

f1(x + hei) − f2(x − hei)

2h 6≈ ∇if (x )

Idea II: Ensemble can improve the robustness of model:

ensemble T models f1(x ), · · · , fT(x ) Our approach: Random + Ensemble

(37)

Adding randomness to the model

Naive approach: adding randomness only in the testing phase prediction accuracy 87% → 20%(Cifar-10 with VGG)

(38)

Our approach (RSE)

Prediction: Ensemble of random models p =

T

X

j =1

fj(w , x ), and predict ˆy = arg max

k pk Training: minimizing an upper bound

E(x ,y )∼DE∼Nloss(f(x ), y )≥ E(x ,y )∼Dloss(E∼Nf(x ), y ) Training by SGD: sample (x , y ),  and conduct

w ← w − η∇wloss(f(x ), y )

Implicit regularizing Lipchitz constant:

E∼N(0,σ2)loss(f(w , xi), yi) ≈ loss(f0(w , xi), yi) +σ2 2 L`·f0,

(39)

Our approach (RSE)

Prediction: Ensemble of random models p =

T

X

j =1

fj(w , x ), and predict ˆy = arg max

k pk Training: minimizing an upper bound

E(x ,y )∼DE∼Nloss(f(x ), y )≥ E(x ,y )∼Dloss(E∼Nf(x ), y ) Training by SGD: sample (x , y ),  and conduct

w ← w − η∇wloss(f(x ), y ) Implicit regularizing Lipchitz constant:

E∼N(0,σ2)loss(f(w , xi), yi) ≈ loss(f0(w , xi), yi) +σ2 2 L`·f0,

(40)

Experiments

We fix σinit = 0.4 and σinner = 0.1 for all the datasets/networks

102 101 100 101 102

C 0

20 40 60

Accuracy (%)

STL-10 + ModelA

No defense Adv retraining Robust Opt+BReLU RSEDistill

102 101 100 101 102

C 0.0

0.5 1.0 1.5 2.0 2.5 3.0

Avg. Distortion

No defense Adv retraining Robust Opt+BReLU RSEDistill

102 101 100 101 102

C 0

20 40 60 80

CIFAR10 + VGG16

No defense Adv retraining Robust Opt+BReLU RSEDistill

102 101 100 101 102

C 0.0

0.2 0.4 0.6 0.8 1.0

1.2 No defense Adv retraining Robust Opt+BReLU RSEDistill

102 101 100 101 102

C 0

20 40 60 80

CIFAR10 + ResNeXt

No defense Adv retraining Robust Opt+BReLU RSEDistill

102 101 100 101 102

C 0.0

0.2 0.4 0.6 0.8

1.0 No defense Adv retraining Robust Opt+BReLU RSEDistill

(41)

Targeted attack

Original image:

Targeted attack

(42)

Conclusions and Discussions

Attack is easy in both black-box and white-box settings back-door attack, one-pixel attack, · · ·

Defense is hard:

Randomness and Ensemble are useful

Robust optimization (under different loss functions, distributions) Directly regularize Lipschitz constant ⇒ not successful, why?

Rethink which model should we use Simple models or complex models?

Or some combination of them?

Thank You!

(43)

References

[1] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, C.-J. Hsieh.ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks without Training Substitute Models. AISec, 2017.

[2] H. Chen, H. Zhang, P.-Y. Chen, J. Yi, C.-J. Hsieh. Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning. Arxiv 2017.

[3] X. Liu, M. Cheng, H. Zhang, C.-J. Hsieh. Towards Robust Neural Networks via Random Self-ensemble. Arxiv, 2017.

[4] P.-Y. Chen, Y. Sharma, H. Zhang, J. Yi, C.-J. Hsieh.EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples. AAAI 2018.

[5] T.-W. Weng, H. Zhang, P.-Y. Chen, D. Su, Y. Gao, J. Yi, C.-J. Hsieh, L. Daniel. Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach. 2017.

[6] N. Carlini, D. Wagner.Towards evaluating the robustness of neural networks. SC 2017.

[7] I. Goodfellow, J. Shlens, C. Szegedy. Explaining and Harnessing Adversarial Examples. ICLR 2015.

[8] V. Zantedeschi, M. Nicolae, A. Rawat. Efficient Defenses Against Adversarial Attacks. Arxiv, 2017.

[9] D. Meng, H. Chen. MagNet: a Two-Pronged Defense against Adversarial Examples. CCS, 2017.

[10] N. Carlini, D. Wagner.MagNet and Efficient Defenses Against Adversarial Attacks are Not Robust to Adversarial Examples. Arxiv, 2017.

參考文獻

相關文件

For goods in transit, fill in the means of transport in this column and the code in the upper right corner of the box (refer to the “Customs Clearance Operations and

Study the following statements. Put a “T” in the box if the statement is true and a “F” if the statement is false. Only alcohol is used to fill the bulb of a thermometer. An

Inside the black box: Raising standards through classroom assessment.. Ministry of Education

Inside the black box: Raising standards through classroom assessment.. Ministry of Education

Direct Access Attack 直接訪問攻擊. 詳細的攻擊手段描述請閱附件一 SQL

On Nemo's first day of school, he's captured by a scuba diver Marlin and his new friend Dory a scuba diver.. Marlin and his new friend Dory set off across the ocean to

 There are two types of background pages: persi stent background pages, and

SG is simple and effective, but sometimes not robust (e.g., selecting the learning rate may be difficult) Is it possible to consider other methods.. In this work, we investigate