behavior feature vector represents a type of malware behavior over a period of time and similar interval behavior feature vectors are classified into the same cluster by GHSOM al-gorithm. All interval behavior vectors are the input of GHSOM, so vectors are appended to a malware dataset. Mall is a malware dataset which consists of many Mi, malware interval behavior sequences. Mi contains the interval behavior vectors representing the order of malware behaviors bN/n in time sequence.
Mall = {M0, M1, M2, ..., Mi} (10)
After the processing of unsupervised learning clustering, GHSOM, each behavior fea-ture vector bi is classified into a behavior cluster ci. Interval behavior feature vectors bi are replaced with interval behavior clusters ci to present the program behaviors.
Mi =
c0, c1, c2, ..., cN/n
(13)
3.5 The Hierarchical SOM Encoding Method
Each behavior vector belongs to a cluster ci in a self-organizing map on the GHSOM layer, in other words, a cluster now is the representative of a behavior feature vector.
Since the clustering result is not machine-operatable for RNN, the transformation for interval behavior clusters ci is needed. To transform GHSOM output to RNN input, this paper proposes three cluster encoding method which are decimal encoding method, weighted one-hot encoding method, and map projection embedding method[32].
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 4: Decimal encoding method
3.5.1 Decimal Encoding Method
Due to the hierarchical architecture of GHSOM, it is a three-dimensional distribution clustering which is difficult to find a way to represent the result without compressing the data. Decimal digits are ten times between each other digits which provide a similar hierarchical structure so the processing flow is processed by the layers of GHSOM results step by step. As shown in Figure 4, the GHSOM consists of the top layer containing only one self-organizing map with clusters, the middle layer and the bottom layer containing several maps. Each cluster in a self-organizing map is numbered in an numeral order. A vector is only classified to the final state of the cluster, in other words, the cluster without the child layer. For example, in Figure 4, the number with the yellow background in each layer is the cluster where a vector belongs. We add all decimal numbers in three layers to one decimal number ’293’ as a cluster representative. If a top or middle layer doesn’t contain child layer, we append zero to the end of the decimal numeral. In order to retrieve an accurate result, after all, representative of clusters is generated, the decimal numeral is multiplied by 0.001.
‧
3.5.2 Weighted One-hot Encoding Method
The weighted one-hot encoding method is derived from the decimal encoding method using the classification feature of one-hot encoding method to label different layer and cluster position. We use the weighted one-hot encoding method to convert interval behavior clusters into interval behavior clusters vectors. The interval behavior clusters vector consists of several one-hot vectors corresponding to the number of hierarchical SOM layers.
As shown in Figure 4, several self-organizing maps are located in each layer. The SOM with most clusters is the one-hot vector benchmark of the current layer, for example, a map with n cluster is a n length vector. If the cluster is classified to the map, the vector of the specific index which corresponding to the order of the map values 1 as a one-hot cluster vector. If the cluster is not in the layer, the vector will append one binary bit as a representative of the nonexistent cluster. Then, we modify the one-hot cluster vectors with the weighted function because the one-hot vectors are losing the hierarchical feature of GHSOM. To recover the hierarchical feature we set the value 1 of the first vector to 4, second vector set to 2 and third vector set to 1. Based on cosine similarity, a measurement of the similarity between vectors, the similarity of weighted one-hot vectors is higher if the vectors are in the same layer. In Figure 4, there are 6 clusters in the first GHSOM layer and interval behavior cluster vector l1 in the second cluster is encoded to 040000 in one-hot encoding method. The second layer contains 9 clusters in the GHSOM layer, and interval behavior cluster vector l2 in the last cluster of the SOM is encoded
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 5: The weighted one-hot encoding method
to 000000002 in one-hot encoding. The third layer contains 8 clusters in the GHSOM layer, and interval behavior cluster vector l3 in the third cluster is encoded as 00100000 in one-hot encoding. Sometimes the interval behavior cluster vector l3 is not classified into the third layer, so we append a binary bit to the end of the encoding cluster vector as the representative. In this case, cluster vector 001000000 appends a place for the interval behavior cluster not classified into the third layer. Finally, we join all li together, and it ends with [040000000000002001000000] to represent a complete 24 bits interval behavior cluster vector. li is the one-hot encoding result in i layer and ci is the corresponding cluster of GHSOM result. We show the binary encoding result map of the GHSOM with the hierarchical SOM binary encoding method.
l1 = 040000 (19)
l2 = 000000002 (20)
l3 = 001000000 (21)
ci =
l1, l2, l3
(22)
‧
We use the labels defined by different antivirus software vendors from VirusTotal[33].
Due to the lack of naming standard of malware family, we use the ”AVClass: A Tool for Massive Malware Labeling”[34], which can label the most suitable family name for each malware. First, AVClass lists all the labels made by antivirus software companies. Then it removes the duplicate malware labels, removes the suffix characters, and tokenize each character to mark it. Finally, AVClass leaves the most representative tokens and select the label with the highest number of recurrences as a family label of the malware. The paper indicates that the accuracy of the AVClass tool clustering can be as high as 0.939, and it will fluctuate with the data set.
3.7 Recurrent Neural Network
Since the input data is time sequential data, the recurrent neural network outperforms many related statistical models, such as hidden Markov models[35]. Therefore, this paper uses the recurrent neural network to analyze the sequential interval behavior vectors.
We use Long-term memory model to analyze the sequential interval behavior vectors to judge whether the program is malicious [36] and determine the family malware belongs to. Because the length of the behavior vectors of the system call is extended, the general recursive neural network will cause the gradient vanishing problem[14], so long short-term memory with forgetting gate is chosen[17]. In order to facilitate the input to the recursive neural network, we use the hierarchical SOM binary encoding method in the previous section to fit the LSTM requirement. We append all complete interval behavior cluster vectors of a malware to make malware behaviors dataset in a two-dimensional matrix Mall. Mi is fixed length, and comprise interval behavior cluster vectors ci0, the superscript and subscript of which is the malware order in Mall and the order of interval behavior cluster vectors respectively. The subscript N/n in ciN/n represent the N system call numbers and n-gram in the previous section.