3. SESSION EXTRACTION SYSTEM
3.2 Extract attack sessions from recorded traffic
One goal of this work is to extract a complete episode of attacks from a large amount of traffic. The session extraction algorithm is a three-pass algorithm designed for this goal by associating packets, connections and sessions to extract attack sessions.
Before the description of the session extraction algorithm, Table 2 shows the definition of the components in session extraction algorithm. The algorithm consists of five steps as follows. Step (i), (ii), (iii) and (v) are trivial works while the step (iv) is the essence of this work.
Table 2.The definition of the components in session extraction algorithm
Names Descriptions
Sip Source IP address
Sport Source Port number
Dip Distance IP address
Dport Distance Port number
Udp
Tcp/ The TCP packet or UDP flag
Payload The content of the packet
P A TCP or UDP packet in the IP network.
) (Pi
Tuple The five-tuple of a packet
A
The anchor packet of the attackPDA(Possible DoS Attacks) The data structure that store the packets could be the DoS attacks
PNDA(Possible Not DoS attacks) The data structure that store the packets could be not the DoS attacks (i) Replay real traffic to IDP products by Tcpreplay.
This algorithm uses the domain knowledge of IDP products, including the well-known Open Source tool, Snort [5]. A IDP product illustrate what attacks have happened with its logs.
10
(ii) Find out anchor packets by the first-pass scan.
This step finds out anchor packets, the critical packets that IDP products alarm when receiving them. There are two tables used herein. One is the alarm log table , which records the alarms of attacks from the replay of attack traffic. The other is the replay log table, which records the time when Tcpreplay replays each packet.( The timestamps from the replay log table are used to mark the attack types by looking for the relation from the alarm log table. The replay log table is then compared with the alarm log table to identify the attack packets.)
Time synchronization could be a problem between the replay system and the IDP products. Even if the time has been synchronized, IDP products may not log the times accurately. Therefore, the five-tuple information is used herein. Many IDP products also log the five-tuple information of an attack (some may record fewer than five tuples).
The five-tuple information and the timestamp from the alarm log table and the replay log table can locate the anchor packets in the real traffic.
(iii) Find out the association among attack packets within the same connection by the second-pass scan.
This step discovers the anchor connection by looking for the relation of the recorded packets with the anchor packets. If the packets have common five tuples with the anchor packet, they belong to the same connection.
(iv) Find out the association among attack connections within the same session by the third-pass scan.
The attack connections can be associated with their session. The association may be difficult since the relation among the connections is obscure. Because the attacks have more than one connection, only five tuples and timestamp are insufficient to find out the other connections. The obscurest relation among the connections is the attack of multiple attackers and a single connection from each attacker because the five tuples of
the packets from these attackers are different. A common attack of this type is the DDoS or DoS attack. These two types of attacks overwhelm a server to deny its capability of providing services. From our observation, such an attack often has only the TCP ACK or SYN message, as well as a number of packets with the same data payload. The session extraction algorithm is designed based on the above observation.
The algorithm parses the recorded traffic packet by packet and extracts an attack session by analyzing the attack types.
After anchor packets of an attack have been found, the algorithm checks each following packet to see if its source IP address or destination IP address is identical to the target IP address of the anchor packet. If not, the packet will be classified to other type of attacks. If the packet belongs to this attack, the algorithm will compare each packet’s payload for similarity. The algorithm duplicates a copy in the possible DDoS attack buffer and increases the packet count by one if the similarity is high. The similarity is defined according to the longest common subsequence (LCS) of two packet payloads [6]. Formally, given a sequence X
x1,x2,...,xm
, another sequence
i i ik
Z 1, 2,..., is a subsequence of X if there exists a strictly increasing sequence
i1,i2,...,ik
of indices of X. given two sequences X and Y, we say that a sequence Z is acommon subsequence of X and Y if Z is a subsequence of both X and Y. The longest common subsequence is the longest subsequence of the all common subsequence.
Consider the payloads of two packets as two sequences of bytes, S1and S2. The LCS of S1and S2, LCS (S1, S2), is the longest sequence of bytes that are subsequences of S1and S2. The similarity is defined by the equation
2 LCS( ) *100%The similarity threshold is 80% in the proposed algorithm because the packets we collected in the DDoS or DoS attacks are often the minimum Ethernet packets of 64
12
bytes. Excluding 14-byte MAC header, 20-bytes IP header, 20-bytes TCP header and 4-byte checksum, the payload is only 6 bytes long. From our observation, the packet payloads of the DDoS or DoS attacks we collected are often the same, and the difference is only one byte if the payloads are different. The similarity in this case is 83.33%, so the similarity threshold is set to 80%.
After identifying similar packets, the session extraction algorithm watches the source IP address and the destination IP address at the same time. The step keeps only the packets that come from the attacker and go to the target and those in the opposite direction. The others are simply dropped. This step intends to distinguish the attacks that possibly have one attacker from those that are possibly DDoS attacks.
The algorithm continues to watch the next packet until the end. The algorithm returns the packet count in the possible DDoS attack buffer. The attack might be a DDoS attack if the count is larger than 200, and might be a 1-1 attack otherwise. Figure 2 shows the flowchart of the algorithm.
The algorithm can be written as some formulas and pseudo code as follows. We defined the packetP is the set of five-tuple and payload. TheTuple(Pi)is the five-tuple of the packet i, i1. The anchor packetA is the set of the five-tuple and payload that the IDP products make alarm when they receive it.
}
Therefore, the session extraction problem turns into a problem to find out the set of packets that have the high similarity of payload with anchor packetA or the same source IP address and distance IP address with anchor packetA. Assume the x is the sequence number of anchor packet in the all packets. The session extraction algorithm can be described as follow.
The pseudo code of the session extraction algorithm
(v) Replay the extracted attack session to IDP products to verify whether the same logs are generated. If it is true, the extraction is valid.
Finally, we replay the extracted attack sessions to IDP products to verify the correctness of the extraction. The extracted session must cause the same alarms as the whole traffic was replayed to the same IDP product. If an IDP product cannot find the attack, the extraction is invalid.