In this chapter, we describe the key technologies used in the proposed scheme.
Moreover, some related works are also discussed and compared in this chapter.
2.1 System Call and Sandbox
System Call
System calls are a kind of function calls implemented by the OS kernel. They provide an essential interface between a process and OS. Processes in a system are run in different modes. The processes run in the user mode have no access to the privileged instructions. If they need any services, they must request the kernel of OS for the services through system calls.
Attackers usually embed the malicious codes into benign programs, and then spread these compromised programs around through the Internet. Even though malware can masquerade in various appearances, it is the invocations of the system calls that the malware cannot alter while doing its critical actions. The series of system calls issued from the same kind of malware resemble each other considerably, and thus we could follow the trail of a program to identify malware by its system call sequence.
Sandbox
A sandbox is a virtualized platform which provides a tightly controlled set of resources and an isolated environment for executing a suspicious program. The sandbox can process interactive procedures with malware. Therefore, the malware can do whatever it wants, such as modifying or deleting files, duplicating several copies to compromise the system, connecting to malicious servers, and downloading files. It is worth noting that above actions will affect neither the host system nor the machines in
6
7
whether the values of a certain system call argument are drawn from a limited set of possible alternatives, i.e. elements of an enumeration. These two anomaly-based detection methods should accumulate normal behaviors with large-scale experiments, or theirs would introduce high false positive rate. Besides, attackers might evade those defenses.
Recently, Mehdi et al. [10] came up with a solution, named IMAD, to obtain sequential system call in fixed length using N-gram segmentation model. This solution discriminates malicious behaviors from benign ones by checking whether the system call sequence is only invoked by malware, and gives the system call sequence a goodness value. If the sequences invoked by both malware and benign programs, it will evaluate their goodness by genetic algorithm. Those goodness values can be used to calculate an overall impression value of a process. The higher impression value of a process gets the stronger probability the process is declared as benign. Later, Rozenberg et al. [6] tackled system call sequences with SPADE sequence mining methodology and genetic algorithm. Those system call sequences invoked only by malware are reserved for later detection. The common shortcoming that both methods view system call sequences in fixed length becomes their Achilles’ heel, because different malicious behaviors might invoke various quantities of system calls.
Lin et al. [4] extracted longest common system call substrings among the same types of malware to represent behaviors. Afterwards, they employed Bayes probability model to assign a value for each behavior. This value indicates the maliciousness and can be used to detect obfuscated programs. Although this method handled variable-length system call sequences, it inclined to missing a lot of sequences owing to their extraction procedure that there is always only one longest common subsequence (LCS) between two system call sequences.
As for sandbox-based efforts, Liu et al. [11] extracted the malicious behavior
8
feature (MBF) of malware by observing processes of Windows systems and then using the MBF to detect the malware, where MBF is a three-tuple, i.e., Feature_id, Mal_level, and Bool_expression. Note that Feature_id is a string identifier which is used to uniquely represent an MBF; Mal_level divides an MBF into three malicious levels: high, warning, and low; Bool_expression is a Boolean expression which specifically defines the behavior of an MBF. The MBF can be used to calculate the malicious degree of a suspicious program. If the program conforms to more MBFs, it would be a malware with a higher probability. Tsai and Wang [12] utilized three sandboxes, i.e., GFI sandbox [14], Norman sandbox [15], and Anubis sandbox [16], to obtain representative behaviors of a program. They selected 13 behaviors that malware frequently carry out but rarely do benign programs. Finally, this method constructed a MD expression by using ANN for malware detection. However, these two methods, of which one declared only 3 malicious levels and the other considered only those behaviors related to malware, lack precise inspection.
Sandbox-based approach makes it faster to analyze a program, and system call-based approach analyzes a program in finer-grained way. To combine strong points of both schemes, we propose a hybrid system. In the faster 1st-phase, we rely on a sandbox to obtain suspicious behaviors and then adopt ANN to compute the MD value. In the slower 2nd-phase, all common system call sequences are dug out by the recursive LCS sequence mining method, and more likely malicious common system call sequences are identified by the Bayes probability model. During the detection stage, only those to-be-detected programs which are not caught by the 1st-phase are processed. In the 3rd-phase, we define type vectors to recognize the malware of known types or an unknown type. We also differentiate between intrusive malware and non-intrusive malware via its exhibited behaviors.
9