Enhancing Security with Redundancy - 隱私保存的高效率資料分類方法

In this section, we show how to add redundancies to the perturbation for enhancing the security of the privacy-preserving outsourcing scheme.

We consider the risk of being linked between the leaked instances and their perturbed ones. Suppose the data owner wants to incrementally add new training instances to the perturbed training dataset which has already been sent to the service provider, to comply with the perturbation of the previously sent perturbed training dataset, the new training instances are required to be perturbed by the same random matrix which perturbs the orig-inal training dataset. If only a few new perturbed instances are sent to the service provider, and the service provider obtains the actual instances of those newly added perturbed in-stances from some external information sources, the service provider will have a good chance to recognize their mappings by applying brute-force attacks. This can happen on the situation when the data owner sends some new perturbed transactions to the

ser-vice provider, the serser-vice provider acquires those new transactions from the compromised customers of the data owner.

Although the random linear transformation is resistant to the distance/dot product inference attacks, if there are only a few new transactions, the brute-force attack can focus on the rather small new set and will have a higher possibility to succeed since the combinations of a few leaked instances and a few new perturbed instances are not too many. If the service provider recognizes the mappings of n or more linearly independent training instances for n-dimensional data, then it can recover all other perturbed training instances by setting up simultaneous linear equations.

A na¨ıve approach to prevent the mappings of new training instances being recognized is simply adding new instances to the existing training dataset and perturbing the whole new training dataset by another random matrix, and then sending the newly perturbed training dataset along with the newly perturbed reduced set to the service provider. This can ensure the large space of mappings to resist the brute-force attack. However, doing this is costly in both computation and communication costs, especially when the training dataset is very large.

In the following, we introduce a secure but less costly perturbation by using redundant perturbations on the reduced set, which ensures that the brute-force attack is not able to derive the mappings of incrementally added perturbed training instances.

Let p denote the number of new instances. When the data owner wants to add new training instances{xm+1, . . . , x_m+p} to the existing perturbed training dataset {c1, . . . , c_m} in the service provider where c_i = M x_i, i = 1, . . . , m, the data owner generates another new random matrix M₁ to perturb the new instances as c_m+i = M₁x_m+i, i = 1, . . . , p, and uses the corresponding matrix to perturb the original reduced set again to generate another perturbed version of the reduced set as s¹_j = (M₁^T)⁻¹r_j, j = 1, . . . , ¯m. Then the data owner sends the perturbed new instances cm+i, i = 1, . . . , p and the newly perturbed version of the reduced set s¹_j =, j = 1, . . . , ¯m to the service provider.

The service provider can derive the kernel evaluations between the new training in-stances and the reduced set k(x_m+i, r_j), i = 1, . . . , p, j = 1, . . . , ¯m based on the dot

product of c_m+i, i = 1, . . . , p and s¹_j, j = 1, . . . , ¯m. In conjunction with the original secure kernel matrix, the service provider can then build a complete secure kernel matrix on the whole training dataset including the newly added ones.

If p ≥ n, where n is the dimensions of the data, the data owner partitions the new instances to q = ⌈_n₋₁^p ⌉ groups, where each group has at most n − 1 instances, and gen-erates q different random matrices Mi, i = 1, . . . , q for perturbing each group of new instances and their corresponding perturbed versions of the reduced set sⁱ_j = (M_i^T)⁻¹r_j, i = 1, . . . , q, j = 1, . . . , ¯m. Then all groups of perturbed new instances and correspond-ing perturbed versions of the reduced set are sent to the service provider for buildcorrespond-ing the secure kernel matrix.

In the above scheme of incrementally adding new training instances to the perturbed training dataset in the service provider, each group of new training instances are perturbed by different random matrices, and the number of instances in each group is smaller than the dimensions of the data. This ensures that the service provider cannot break the per-turbations even if the service provider obtains the actual content of new instances from external information sources. Because for breaking an n× n random linear transforma-tion by brute-force attacks, at least n linearly independent instances are required to setting up simultaneous equations. However, in each group of linear transformation, there are at most n− 1 instances. Without enough linearly independent instances, there will be infi-nite number of solutions. Hence all the random linear transformations in the incremental addition of perturbed training data cannot be broken.

This approach provides security guarantees on the incremental addition of new train-ing instances. The redundant communication cost of ustrain-ing this scheme is sendtrain-ing addi-tional perturbed versions of the reduced set. Compared to the na¨ıve approach which sends the complete newly perturbed training dataset, it can save much communication cost be-cause typically the size of the reduced set is smaller than 10% of the training dataset.

For large dataset, the reduced set can be very small, simply 1% of the number of training instances is appropriate given that the size of the reduced set is larger than some toler-ance [28].

The communication cost of the secure incremental approach is (p + ¯m⌈_n₋₁^p ⌉)n for sending the p perturbed new training instances and⌈_n₋₁^p ⌉ groups of new perturbed ver-sions of the reduced set. The na¨ıve approach requires sending a new perturbation of the whole training dataset and the reduced set, which costs (m + p + ¯m)n. If the size of the reduced set ¯m is small or the dimensions of the data n is large, the secure incremental approach can help to save much communication cost.

在文檔中隱私保存的高效率資料分類方法 (頁 45-48)