After acquiring behavior profiles, we start to run data analysis framework. The framework is shown in Figure 3.1. Because we have obtained the domain knowledge of network behaviors form experts, some enhancement can be used to improve the framework.
5.1:Feature Vector Integration
Because of adopting different network traffic data format, different researches use different methods and have different analysis results. For taking advantages of different analytical methods, integration of multiple data sources is a very important procedure. Multiple data sources contain different data formats, so they need to be preprocessed and transformed to an integrated data with a common data format.
Two algorithms shown below are purposed in [27] to integrate different data sources:
z Multi-Source Data Format Integration (MSDFI) Algorithm: The concept of a new data source is generalized to the connection level first. If the concept of the new data source is already at connection level, the generalization is omitted.
Second, features with different types of new data sources are added into the integrated data source. At last, the integrated data format can be used to merge multiple network traffic data sources.
z Data Source Transformation (DST) Algorithm: It shows how to integrate multiple network traffic data sources according to the integrated data format that MSDFI algorithm generated.
After DST process, we get a feature vector. There are also some features can be identified in behavior profiles. They will be integrated to generate a new feature vector. A simple process of feature vector integration is described as follows. Some notation is defined before the description.
z F : a temporal feature set z f : a feature
z FV: the feature vector generated by DST.
First, features need to be identified from behavior profiles. If an attribute has a specific numerical value, then add a feature f named by “behavior_name count” in a temporal feature set F, and the attribute has a specific numerical value is the condition of the corresponding feature. Second, integration of the temporal feature set F and the feature vector FV is executed. For each feature f in the F, if f is not in FV, than add f in FV.
Take the “Ping flood” for an example, as shown in Fig 4.10. The condition:
“type=8” could be found. After identifying the features, they are integrated into the original feature vector generated by DST. In data warehouse, fact table is the place where network traffic integrated data are stored. Network raw data are transformed into connection feature vectors in Data Transformation, and then integrated with features identified from behavior profiles. The integrated feature vectors are stored in fact table without generalization or aggregation. The format for fact table is the same as feature vector mentioned in Data Preprocessing Phase. Some field is related to dimension table and others are measures. In fact, features identified from behavior profiles are measurement in the fact table. The illustration about format of Network Traffic Fact Table is shown in Table 5.1.
5.2:Concept hierarchy and Data warehouse construction
If network traffic concept hierarchies for integrated data source are constructed by Knowledge Acquisition process, the integrated network traffic data can be analyzed in different concept level. For example, the behavior of a host can be evaluated by analyzing IP dimension in IP-address concept level; however, behaviors of a subnet can be evaluated by analyzing network traffic in class-c concept level after concept hierarchies are constructed. With constructing concept hierarchies and data cube, evaluating behaviors in every concept level of IP dimension is natural because of roll-up and drill-down operations that OLAP server offered. Without constructing concept hierarchies and data cube, administrators have to search manually for network traffic data of a subnet from a flat data source for evaluating behavior of a subnet.
Analyzing network behaviors on every concept level of every dimension would become easier with the assistance of the constructed concept hierarchies and data cube.
5.2.1: Concept Hierarchy Construction
Here, Dimension Concept Hierarchy Knowledge Acquisition (DCHKA) algorithm proposed in [27] is used to construct concept hierarchies. The input of
SrcIP Table 5.1 Fact Table
DCHKA is the feature vector format generated in Data Preprocessing Phase. Concept hierarchy is constructed from bottom to top because the original data collected in previous phase are based on the lowest concept level. Experts are guided to generalize concept from lower concept level to higher level and to define the mapping relations between values appearing in lower concept level and higher concept level. Repeat the steps in DCHKA algorithm for each dimension in the feature vector format and a concept hierarchy would be constructed at last. An example of constructed concept levels is shown in Figure 5.1. With the help of expertise in the form of concept hierarchy, behaviors in different concept level can be evaluated and analyzed.
Day
Figure 5.1 Concept levels of each dimension of network traffic data
5.2.2: Data Warehouse Construction
After constructing dimension concept hierarchies, data cube schema need to be selected in order to build network traffic data cube. The most common modeling paradigm is the star schema, in which the data warehouse contains (i) a large central table (fact table) containing the bulk of the data, with no redundancy, and (ii) a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table. In network traffic data, star schema is the most suitable schema for the relation between raw data and concept hierarchies.
The steps after selecting data cube schema are selection of dimensions and measurements from fact table. An example of dimensions has been shown in Figure 5.1. Features which are used to evaluate behaviors can be chosen to be measures. In network traffic data, feature such as Packet Number, Packet size, Connection Number, Number of Ping packets, Number of SYN packets, etc. are chosen to be measures.
Measures are aggregated when concept level is generalized from low concept level to higher concept level. Therefore, network behaviors can be evaluated by measures from different dimension. For example, total packets size can be used to evaluate behavior of a host or a subnet.
Following the steps mentioned above, a star schema as shown in Figure 5.2 of network traffic data cube can be constructed.
Figure 5.2 : Cube schema for constructing data warehouse
5.2.3: Dimension Table Maintenance
Dimensions in network traffic data have the characteristic that the number of values in each concept level is large. For example, the number of all possible values of IP addresses is 256*256*256*256, but only tiny fragment which ranges from thousands to ten thousands will appear in our network traffic data. It is wasted and impossible to maintain all IP addresses in Source IP dimension table or Destination IP dimension table. Only IP addresses which communicate to monitored hosts are maintained in dimension table. When a new IP address appears, the new IP address is added into dimension table and the corresponding higher concept level value is confirmed. If the higher concept level value of the new IP address does not exist in dimension table either, new higher concept level value is added. As time goes on, the size of dimension table becomes very large so that a proper method to decease the size of dimension table such as deleting the IP addresses that do not appear for a long time is needed.
Other dimensions such as Alert and Port have the similar characteristic. It is unnecessary to store values that never appear or not appear for a long time in network traffic data into dimension tables. This will cause the low performance of OLAP server because of dispensable join time and query time. So dimension tables with this characteristic should be adjusted dynamically for higher system performance.
5.3:Data analysis
In the original framework, administrators need to construct the meta-data, then use it to find the interesting data set. This process of constructing meta-data needs administrator manipulate the Guided Monitoring Interface (GMI) manually, and the
result of interesting data set is highly dependent on how much domain knowledge that administrators have. In order to reduce the administrators’ effort, we use behavior profiles to enhance the original GMI. Behavior profiles which represent the domain knowledge can be used to generate a part of meta-data.
Figure 5.3 Comparison between original and enhanced workflow
The comparison between original framework and enhanced one is shown in Figure 5.3. As mentioned above, in the original workflow, cube meta-data was constructed according to the requirement of administrators hinted by Cube Meta-data Construction Algorithm (CMCA). In the enhanced workflow, the acquired behavior profiles can be used to select behavior data form the original huge cube.
The general process step of CMCA is shown in Figure 5.4. The CMCA is just a general guide for manipulate GMI, but the detailed decision during CMCA is highly
dependent in the experience of administrators. Therefore suspicious network behaviors may not be able to identify as soon as possible by a junior administrator.
Because the domain knowledge have been acquired as behavior profiles. Knowledge of which dimension should be chosen and which specific value should be monitored for the corresponding behavior is record in the behavior profiles. Hence, we can use the behavior profile to reduce analyzing effort of administrators, especially junior ones.
Figure 5.4 General process step of CMCA
As mentioned above, the first and the third step in Figure 5.4 could be done by the support of domain knowledge. Therefore the administrators’ effort could be reduced.
5.3.1: Behavior model Transformation
In order to directly choose the data of network behaviors from database or data warehouse, data query needs to be generated. Since we have the knowledge of network behavior models, it can be used to generate data queries. Here we propose a process to transform the behavior profiles to the corresponding data queries. As shown in Algorithm 3
Input: Behavior Profiles
Query = Select DstIP , count(DISTINCT SrcIP) From traffic record
Group by DstIP
Having count(DISTINCT SrcIP) > X Case “1_to_Many [X]” :
Query = Select SrcIP , count(DISTINCT DstIP) From traffic record
Group by SrcIP
Having count(DISTINCT DstIP) > X }
Step 2: Choose the data source according to the protocol of the behavior traffic record = protocol traffic data
Step 3: Add the constraint into the query z Flag constraint
‧ Add condition in“where” clause z Threshold constraint
‧ If no word “DISTINCT” appears in the constraint
‧ If connection mode is many_to_1
‧ add count(SrcIP) as behavior count in“select" clause
‧ Add thresholds of count(SrcIP) in the“Having" clause
‧ If connection mode is 1_to_many
‧ Add count(DstIP) instead of count(SrcIP)
‧ else
‧ Add count(DISTINCT Flag_name) as behavior count in “select”
clause
‧ Add thresholds of count(DISTINCT Flag name) in the ”Having” clause Algorithm 3: Behavior Model Transformation Algorithm
Take the behavior profile shown in Figure 4.10 for an example. The connection type of ping flood is “many_to_1”, and suppose the default threshold of “many” is X.
(How many source IPs connect to a Destination IP could be treated as suspicious behavior? X is the threshold of number of distinct source IPs.) The protocol used in ping flood is “ICMP”, and the constraint of ping flood is that for every packet in the ping flood, the value of flag “Type” is 8. Moreover, administrators can set the threshold of the amount of ping packets to be treated as a ping flood. Hence, the corresponding data query is shown as follows:
Select DstIP , count(DISTINCT SrcIP), count(SrcIP) as Ping Flood Count
From ICMP traffic record
Where type=8
Group by DstIP
Having count(DISTINCT SrcIP) > X and count(SrcIP) >y
A simple result of the above query is shown in Table 5.2 , when X=50 and Y=1000000.
Table 5.2 The result of data query
DstIP # of SrcIP Ping flood count
10.113.87.175 58 1000670
… … …
5.3.2: Hierarchical relation
After acquiring network behaviors, hierarchical relations could be identified from behavior profiles. Hierarchical relations are identified by checking if one behavior has more detail constraint than the other which has the same values of connection type and protocol in the behavior profile, as mentioned in section 4.3. The relation could be used to reduce the effort of data query.
Figure 5.5 original data query and enhanced one
Hierarchical relation between two network behaviors could be used to simplify the data query of the behavior which is the subset in the relation. Suppose that two monitored behaviors A and B with a hierarchical relation between them, B is a subset of A. Data set of A could be looked up by the corresponding data query generated by
Algorithm 3. Afterward the data query of behavior B could be reduced by looking up data from data set of A. The constraint in the data query of B is just the additional constraints which not appeared in the behavior profile of A. As shown in Figure 5.5, without the hierarchical relation, data set B is queried from huge data cube by the original data queries. After knowing the hierarchical relation, data set of b could be looked up form data set A. Thus the time of looking up data set B using the reduced query is shorter, because data set A is much smaller than original data cube.
For example, Smurf flood is a subset of ping flood. Without knowing this hierarchical relation, the corresponding data queries of the two behaviors are shown below:
I. Data query of ping flood:
Select *, count(DISTINCT SrcIP), count(SrcIP) as Ping Flood Count From ICMP traffic record
Where type=8 Group by DstIP
Having count(DISTINCT SrcIP) > X and count(SrcIP) >y II. Data query of Smurf flood:
Select *, count(DISTINCT SrcIP), count(SrcIP) as Smurf Flood Count From ICMP traffic record
Where type=8 and DstIP like ‘%.255’
Group by DstIP
Having count(DISTINCT SrcIP) > X and count(SrcIP) >y
After knowing that Smurf flood is a sub set of ping flood, the data query of Smurf flood could be simplified after data set of ping flood has be queried out. The modified data queries are shown as follows:
I. Data query of ping flood:
Create view Ping Flood record
Select *, count(DISTINCT SrcIP), count(SrcIP) as Ping Flood Count From ICMP traffic record
Where type=8 Group by DstIP
Having count(DISTINCT SrcIP) > X and count(SrcIP) >Y II. Data query of Smurf flood:
Select *
From Ping Flood record Where DstIP like ‘%.255’
By the above example, we can see that the data query can be simplified by the hierarchical relation. After the data set generated by the data query, it could be input into further analysis mechanism such as data mining. Discover deeper knowledge from data mining process is not the focus in this thesis. Applying data mining for analysis is discussed in the previous research [27], such as DMAS [3] for mining association rules.