Approach - 利用行為相似性偵測Android平台惡意應用程式

In this chapter, we detail the processing of our system call sequence detection mechanism. In Section 4.1, we give an overview of system call sequence processing which we present. The details of our approach are illustrated in Section 4.2 and Section 4.3, respectively. Final, we describe the implementation issues in Section 4.4.

4.1 Approach Overview

Figure 1. Flowchart of System Call Sequence Processing

Figure 1 shows the overview of our approach. For extracting the common behavior patterns, three phases, including recording phase, extraction phase, and evaluation phase, are involved in the proposed approach. In the following, we first describe these phases, respectively.

Recording Phase

In this phase, for all of applications in , we sequentially select an application from , and then execute the individually to record the system call requests during the Android system runtime. Next, we extract Si from all of system call data by

the process ID of and the corresponding child thread IDs. After all of extracting, we can obtain in the recording phase.

Extraction Phase

The extraction phase aims to extract a set of common system call subsequences from all of . Since the same malicious code can trigger same system call sequences, we extract by Longest Common Substring (LCSs) algorithm for comparing system call sequences of different repackaged applications. In addition, to improve the efficiency, we also present a mechanism for reducing the time cost of extracting within multi-thread system call sequences. Finally, is obtained and each common system call subsequence in is denoted as , where is the index of subsequences in . It needs to address clearly that we extract substring in the system call sequences. The substring is a consecutive part of system call sequences, but we still denote it as subsequences since a substring is always a subsequence and most works use subsequences to call their system call combinations.

Evaluation Phase

However, it is impossible for all of the extracted subsequence to be used as the detection patterns. This is because some system call subsequences not only appear in repackaged applications, but also in benign applications. For example, some extracted subsequences are too short to be a part of malicious code. So, in the evaluation phase, we evaluate appearance probability of subsequences to filter out non-discriminating subsequences. We leverage the Bayes probabilistic model to calculate the probability of by counting the number of each subsequence appearance in a set of benign applications and . After that, we can indicate those non-discriminating subsequences in and filter out them because they have the lower probability. After filtering, we can get , which has the higher probability, appearance in malware applications but have a lower probability appear in benign applications at the end of

this phase. If an application matches the same system call subsequence pattern of

in execution time, we can claim that the application is repackaged with the same malicious code.

4.2 Extraction of System Call Subsequence

Since the common malicious behavior is the most resemble part of repackaged applications, the system call sequences which are collected from repackaged applications should also contain the common subsequences. It is our hope to extract the common subsequence of system call sequences to indicate the malicious behavior.

In the extraction phase, suppose that each system call names as an element, the LCSs algorithm can find the longest common consecutive system call combination with two different system call sequences. Figure 2 illustrated how to extract subsequence from multiple applications in the extraction phase which contains two parts of procedure.

When a new is added to the procedure, the first part extracts subsequences between and all of the processed sequence sets { , , , …, } in the queue. The second part extracts subsequences between and all of the extracted subsequences in the storage. The second part can extract the common subsequences more precious. After extracting the common subsequences, this procedure adds this application into the queue, saves all of the extracted subsequences into the storage, and processes for the next target.

Layering Multi-Thread Comparison

Android applications usually have multiple threads. Under this situation, we can get several system call sequences from different threads of an application. To extract system call subsequences by LCSs algorithm, the simpler way of processing multiple sequences is to exhaust all combinations of two sequences. However, it leads the heavy overhead if an application has many child threads. Suppose we have applications and each application has threads. The time complexity can be denote

as . To overcome this drawback, we present a Layering Multi-Thread Comparison (LMTC) mechanism to reduce the processing overhead on multi-thread

system call sequences in the extraction phase. Since child threads have their parents, we can establish the thread tree for each that depends on the relationship of threads of parent-child. All of the threads are separated into several layers in the thread tree. It is our observation, that same behaviors are usually appeared at the same layer. Even these behaviors are belonged to different applications. Since every application has a regular order of work processes, the order is also appeared in the relationship of threads of parent-child. Without loss of generality, we assume the malicious code pattern should be appeared at the same layer of process trees of different repackaged application. The comparison times of subsequence extraction can be reduced through only extracting the common subsequence with the other same layer system call sequences. Suppose each thread tree is layered into layers in average, the time complexity of comparison can be denote as ⁄ .

Figure 2. Example of Thread Trees

Figure 2 shows a clearer example of the process trees of malware “DroidDream Light”. The five thread trees are extracted from five different applications which are

repackaged with the same malicious code. Obviously, they have varies tree structures

and the different number of child threads. Each node denotes an individual thread, and the arrows indicate the relationship of parent-child between two threads. If we need to extract common system call subsequences from these five process trees, for each sequence extraction, each thread tree needs to compare the chosen thread with the other four threads. The total number of extract combinations is 98784 ( ). However, only 5437 extract combinations are available by applying LMTC in this example. ( 1 ( ) combination in the first layer, 5400 ( ) combinations in the second layer, and 36 ( ) combinations in the third layer, respectively. It is worth noting that the fourth layer has no combinations since some process trees do not have threads in the fourth layer.)

4.3 Sequence Evaluation with Probability of Appearing

We get a set of system call subsequences after the extraction phase, but not all of the extracted subsequences can be guaranteed as the truly malicious applications that do not appear in the benign applications. Therefore, we have to evaluate the extracted subsequences in terms of the probability of appearance under the Bayes probability model. The probability formula is calculated as

) the probability that the given applications are benign applications. denotes the probability that the given applications is a repackaged application. ( | ) denotes the probability that the subsequences appear in benign applications, and ( | ) denotes the probability that the subsequences appear in repackaged applications.

Finally, we can get the probability ( | ) that an application is a repackaged application, detected by the specific system call subsequence. In the first step of evaluation phase, we count the number of sequences occurrence in benign

applications and malicious applications. Then, in the second step, we can calculate the probability ( | ) by the formula (1). After the processing of probability calculation, we obtain the subsequences that appeared in the malicious application has the higher probability than that of benign applications. And this information helps us to find which achieves the higher accuracy for detecting repackaged applications.

4.4 Environment Implementation

To execute a large amount of applications, we deploy the Android emulator for easier managing the execution environment. When a new application executes each time, we establish a new environment and install such an application into the system to keep the integrity and correctness of environment. It can prevent the situation that an application is distorted by other applications that are not really relevant anymore.

In the testing environment, the version of Android emulator we used is the Android version 2.1. According to our heuristic experiments, Android version 2.1 is the most suitable environment to execute samples.

System call recorder

Figure 3. Zygote Mode

For recording the system calls during the period of system runtime, we use the tool “strace” [18] to trace system calls and signals in Linux kernel. We change the boot configuration of Android emulator to make strace attach on Zygote. As illustrating in Figure 3, Zygote is a monitor process of an Android system. It usually focuses on processing the request about executing a new Android application. The

new applications will be forked from Zygote and executed. This procedure is called Zygote Mode. Since all of the Android applications are forked from Zygote, strace can

monitor all of the executed Android applications by tracing the child processes of Zygote. strace individually records system calls of different threads into separated

files, and each thread produces system call sequences. After the time quantum is large enough to record system calls, we can extract the system call sequence of the application that we want to monitor by its process ID. Moreover, the child sequences can be traced by checking whether the thread had forked any child threads. The behavior of producing a child thread can be found by clone and fork system calls in the system call sequence. The system call sequences of child threads are also extracted for saving the complete behaviors.

在文檔中利用行為相似性偵測Android平台惡意應用程式 (頁 19-26)