Proposed Model: Virtual Product - Communication Efficient Distributed Agnostic Boosting

Chapter 7: Communication Efficient Distributed Agnostic Boosting

8.3 Proposed Model: Virtual Product

Given a security product P , our Virtual Product model aims to detect and categorize incidents for customers who have not deployedP . We formulate the construction of Virtual Productas a classification problem, training it on machine-day observations collected from machines that have deployedP . The training process learns the functional mapping between the event occurrence patterns of other products and the incident class labels reported by P . During testing, the Virtual Product model takes as input the observed event occurrence patterns from products exceptP , and produces incident detection and categorization results.

A main challenge for Virtual Product is in training and applying it with incomplete event occurrence patterns as input. Events may be missing either because their corresponding products are not deployed, or due to data corruption at the telemetry data collection process.

To address this issue, we propose a semi-supervised non-negative matrix factorization method (SSN M F ) as a core computation technique for Virtual Product. It extracts an unified discriminative feature representation of the event occurrence records from both the training and testing datasets. We conduct incident detection and categorization in Virtual Product by feeding the learned feature representations as input to any standard supervised classifiers. Virtual Product denotes the process of conductingSSN M F on event occurrence

data, followed by training a supervised classifier on the output ofSSN M F .

Another contribution of Virtual Product is to estimate event occurrence patterns that are missing from the observed data. It is helpful for security analysts to understand relations between event occurrence profiles and reported security incidents. The SSN M F well matches this requirement, as it is intrinsically equipped with the capability of reconstructing the missing event values through inner product of low-rank matrices.

8.3.1 Semi-Supervised Non-negative Matrix Factorization (SSNMF)

We use a non-negative data matrixX ∈ R^{N ×M} to denote the aggregation of both training and testing event occurrence data. Each row inX, noted as Xi,: denotes occurrence counts of different events around a machine-day. Without loss of generality, the firstN1rows ofX belong to the training event occurrence data. They are equipped with corresponding incident class labels reported by the target productP . The remaining N − N1 rows ofX are the testing data corresponding to event occurrence data collected from customers’ machines withoutP deployed.

Non-negative Matrix Factorization reconstructs a non-negative data matrixX ∈ R^{N ×M} using the dot product of two non-negative factorsU ∈ R^{N ×k}andV ∈ R^{M ×k}, wherek is the number of latent features that is often determined by cross-validation. As shown in Equation (8.1), the latent factors are learned by minimizing the reconstruction error on the observed events in our data.

U, V = argmin

U,V >0

kX − U V^Tko

2 (8.1)

The norm kkoindicates the aggregated reconstruction error on the observed entries of X.

Each row inU , Ui,: represents the linear projection ofXi,:, which formulates a new feature representation of machine-day observations in a low-dimensional space. Column vectors of V are the projection bases spanning the projection space.

To integrate supervision information into the matrix factorization process, we introduce a class-sensitive loss into the objective function of matrix factorization, in order to force

machine-day observations of different classes to be separated from each other in the projected space. Equation (8.2) and Equation (8.3) give the formulation of the discriminative loss functions defined for binary and multi-class classification scenarios, respectively.

F ( ˆY , U, W ) = −

i=1

Yˆilog 1

1 + exp (−Ui,:W^T) + (1 − ˆYi) log 1

1 + exp (Ui,:W^T)

(8.2)

F ( ˆY , U, W ) = −

i=1 C

j=1

Yˆi,jlog exp (Ui,:W_j,:^T) P

j⁰exp (Ui,:W_j^T⁰_,:) (8.3) where W ∈ R^1×k stores the regression coefficients. ˆYi represents the class label of each machine-day observation. For labeled machine-days, ˆYi it either 1 or 0, depending on whether Xi,: belongs to positive or negative class. For unlabeled machine-days, ˆYi

represents any plug-in estimator of probabilistic confidence ofXi,: belonging to positive class. In the multi-class version of the loss function,C denotes the number of classes in the labeled dataset. As a result,W becomes a R^C×k matrix. Each row inW corresponds to the regression coefficients for each class. ˆYi,j of labeled data is defined following one-hot encoding scheme. For unlabeled data, ˆYi,j represents the probabilistic class membership of eachXi,:. U is the common factor shared by both matrix factorization in Equation (8.1) and the class-sensitive loss function defined in Equation (8.2) and (8.3). This design guarantees the feature representationU preserves the class separating structure of the training data.

Y for unlabeled data can be initialized using external oracles with probabilistic output,ˆ such as gradient boosting and logistic regression. In this work, we treat ˆY as one variable to learn and estimate it by jointly optimizing the objective function with respect toU , V , W and ˆY . We assume that unlabeled data points with similar profiles are likely to share similar soft class label ˆY . By enforcing such assumption to the objective function design, we explicitly inject supervised information into the projection of both labeled and unlabeled

machine-day observations. The complete optimization problem ofSSN M F is shown in the following equation.

U, V, W, ˆY = argmin

U,V ≥0,W,1≥ ˆY ≥0

kX − U V^Tk_o²+αF ( ˆY , U, W )

+βT r( ˆY^TL ˆY ) + γ(kU k²+ kV k²) +ρkW k² s.t. ˆYi =Yi ifXi,: is labeled

(8.4)

The constraint in the objective function requires strict consistency between ˆY and the true class labels on labeled machine-day observations. L is the graph laplacian matrix defined based on K-nearest neighbor graph of the whole data matrix X. Minimizing the trace functionT r( ˆY^TL ˆY ) propagates the confidence of class membership from true class label of labeled machine-days to unlabeled machine-days. It embeds class-separating information into the projectionU of unlabeled machine-days. Regularization terms γ(kU k²+ kV k²) andρkW k² are added to prevent over-fitting.

8.3.2 Optimization Algorithm

We use coordinate descent to optimize Equation 8.4. During each iteration,U , V , W and ˆY are updated alternatively. One of the four variables are updated while all the others are fixed.

Iterations continue until the objective value cannot be further improved. U , V are updated using multiplicative update [163], which is a popular optimization technique for solving many variants ofN M F . Equation (8.5) gives the formulations of multiplicative update of U and V

U^t+1= U^t [(X M )V ]⁺+ [(U V^T M )V ]⁻+ α[ ˆY W ]⁺+ α[RW ]⁻ [(X M )V ]⁻+ [(U V^T M )V ]⁺+ α[ ˆY W ]⁻+ α[RW ]⁺+ γU V^t+1= V^t (X M )^TU

(U V^T M )^TU + γV

(8.5)

where [A]⁺= (|A| + A)/2 and [A]⁻ = (|A| − A)/2. R is the output from the sigmoid

function (binary classification)Ri = _{1+exp (−U}¹

i,:W^T) or the softmax function (multi-class classification) Ri,j = ^{exp (U}^i,:^W

T j,:) PC

j0=1exp (Ui,:W^T

j0,:). The operation indicates Hadamard product between matrices. M is a entry-wise weight matrix. Mi,j = 1 if the entryXi,jis observed, andMi,j = 0 otherwise.

Updating ˆY consists of two components. For one aspect, the learning of ˆY is based on supervision information propagation. For the other aspect, estimates of ˆY depends on the output from the sigmoid or softmax function, which encodes the retraction from data reconstruction penalty in the objective function. Equation (8.6) and Equation (8.7) define how to estimate ˆY in binary and multi-class classification scenarios:

Yˆi =Yi if Xi,: is labeled

Yˆ^t+1= ˆY^t α log (1 + exp (U W^T)) + 2βS ˆY α log (1 + exp (−U W^T)) + 2βD ˆY

(8.6)

Yˆi =Yi if Xi,: is labeled Yˆ^t+1= ˆY^t α log ˆR + 2βS ˆY

2βD ˆY

(8.7)

where ˆRi,j = ^{exp (U}^i,:^W

T j,:) PC

j0=1exp (Ui,:W_j0,:^T ). S is weight matrix of K-nearest neighbor graph. D is a diagonal matrix withDi,i defined asPN

i=1Si,j.

By removing terms withoutW in Equation (8.4), the left terms of the objective function formulate aL2-penalised logistic regression with soft class labels ˆY and training data points in the projected spaceU . Therefore, learning W given U and ˆY fixed can be performed through iterative gradient descent until convergence. We found that the number of iterations can be dramatically reduced by choosing the step size of gradient in an adaptive way using

Dataset #Machine Days

#Detected

Incidents #Events Sparsity #Incident Type

FW1: Firewall 1 4506 770 1011 98% 10

FW2: Firewall 2 9254 3093 1927 99% 12

FW3: Firewall 3 4477 1274 2019 98% 10

EP1: Endpoint Protection 1 18983 4128 2409 99% 30

EP2: Endpoint Protection 2 8006 904 988 97% 5

Table 8.3: Summary of the training datasets (Jul-Sept) for the top five products that detect the most incidents.

AdaGrad [169], as shown below:

W^t+1=W^t+λ GW^t

qPt−1 t⁰=1G_Wt0

(8.8)

whereGW is the gradient of Equation (8.4) with respect toW .

WhenU , V and W converge, U1:N1,: andUN1:N,:are used as low-dimensional feature representations of the training and testing data, respectively. We then train a logistic regression on the row vector space ofU to conduct incident detection and categorization in Virtual Product. Note that we are not restricted to logistic regression. We choose it due to its simplicity and probabilistic decision output. Despite its simplicity, it shows superior performance thanks to the learned feature representationU , as reported in the experimental study.

在文檔中 I Adversarial Attack and Defense of Deep Neural Networks 17 (頁 153-158)