• 沒有找到結果。

Input: The case clusters come from previous stage.

Output: Relation Table between k knowledge concepts, and the meta knowledge of each relation.

Step 1. Resolve or eliminate the redundancies within knowledge concepts, and identify the meaning of each knowledge concept.

Step 1.1. Check each knowledge concept, and eliminate redundant cases.

Step 1.2. Explain the meaning of each cluster, and name the clusters with corresponding concept.

Step 2. Define the interface cases for each knowledge concept:

Step 2.1 Construct an empty relationship table.

Step 2.2 Fill in the knowledge relations according to NORM knowledge relations by experts.

Step 2.3 Ask experts to design the meta-rules to link and interact between knowledge concepts for each relation between knowledge concepts.

Step 3. Construct the ontology of knowledge concepts.

The procedure in this stage can help construct the knowledge concept relationships; in other words, the relations between knowledge concepts construct the ontology of the domain. As we have mentioned, NORM is used here to represent the ontology, hence we will use the Rule Class in NORM to represent concepts, and use those four kinds of relationships to build up the concept ontology.

5.3 Knowledge Extracting Step

So far, not only the ontology between knowledge concepts but also the cases of each knowledge concept are defined. In this stage, the knowledge engineers can design the grid for extracting knowledge from experts, and once the grid for extracting knowledge is designed, experts will be asked to fill in its appropriate values.

The column header of the grid is the cases to be identified in a concept, and the row header of the grid is the union of keywords (features) of cases. For example, a grid for extracting knowledge about some different types of intrusions may be like Table 1.

Table 1. An example grid used for knowledge acquisition

Ping of Death ICMP flood IIS memory leakage

ICMP YES YES NO

TCP NO NO YES

Heavy Load NO YES YES

Corrupted Packet YES NO NO

Crashed YES NO YES

From the filled grid, rule as the knowledge can be obtained. For example, one of the rules generated from above grids is shown as:

IF “ICMP” and “Corrupted Packet” and “Crashed”

then “Ping of Death”

For extracting the rules with embedded meaning, Embedded Meaning Capturing and Uncertainty Deciding (EMCUD) knowledge acquisition [HT90] based on Personal Construct Theory is used in this stage. Since ontology is discovered in previous phase,

the information about the relation and hierarchy between knowledge concepts is included in our knowledge extraction stage. Hence, Two Phase Knowledge Acquisition (TpKA) [TT02] mechanism is used to extract the knowledge with given concept relations and to find more meaningful and accurate knowledge content. With TpKA, the embedded meaning and certainty factor of knowledge will be reviewed according to the knowledge hierarchy built in previous phase.

Chapter 6 Knowledge Discovery

Rule base system is usually used in designing a knowledge based system, which is used to provide suggestions on decision making as a domain expert. However, since the knowledge in a rule base is usually acquire from one or few experts, that means there are many cases that the knowledge is generated according to their own experience, and some knowledge may be not included due to lack of experience. In order to make the rule base system to be more complete and smart, the knowledge of general users should also be discovered and used to refine the rule base system in knowledge systems.

In modern computer systems, user activities are usually recorded by system log information, which means there is some information regarding the user behaviors hidden in the log information. In our knowledge discovery mechanism, the log information of computer systems will be used to find the pattern of the user behavior, which can be the user knowledge for system operating, problem solving.

The input format of our method is the user activities records or logs sorted by the time.

As shown in Figure 6.1, there are several phases in our method including Preprocessing Phase, Two-Layer Pattern Discovering Phase, and Pattern Explanation Phase. At first, the Preprocessing Phase could select activities logs stored in the data storage and aggregated these activities logs into a feature vector, which represents the behavior during a short period for further analysis. Furthermore, each user’s behavior can be presented as a sequence of feature vectors. In Two-Layer Pattern Discovering Phase, there may be millions of distinct feature vectors, which will be first clustered

into several clusters. In this phase, two heuristics are proposed to detect outliers, which are quite different from normal behaviors, and these outlier clusters can be explained in Pattern Explanation Phase. Accordingly, some feature vectors which are similar in representing the same behavior may be grouped into one cluster. In other words, each feature vector can be mapped to a cluster label by a mapping function, and each user’s behavior can be transformed into a sequence of cluster labels. Next, we are also concerned about patterns of single user’s behaviors and common patterns of all users’ behaviors to mine the patterns of users’ behaviors. Since each pattern is represented as a sequence of clusters and each cluster has its own property set, the pattern discovered in previous phase can be represented as a sequence of property sets, can be determined to be normal or abnormal, and can be feedbacked into knowledge base in Pattern Explanation Phase.

Preprocessing

Feature Vectors

Two-Layer PatternDiscovering Pattern Explantion

Packets Knowledge Base

Known Patterns Patterns

Activities LogDatabase

Figure 6.1: The Concept Diagram of Our Method.

6.1 Preprocessing Phase

Before presenting our method, the notations used in this paper will be defined in this section. For transforming original activities logs into a feature vector, which contains more useful information, the RENUMBER SORT ALGORITHM and the PREPROCESS ALGORITHM are also discussed in this section.

6.1.1 Definitions of Original Activity Log Database

Assume there are n users u1, u2, …, un. Each uq can be represented by a unique ID, e.g., the user id of a web system or the customer ID of a shopping center, and let U = {u1, u2, …, un}.

T = [t0, t0+wc] is the time interval concerned to collect activities logs where c is a

constant, t0 = 0, ti = ti-1 + c, and Ti = (ti-1, ti], 1 ≤ i ≤ w.

Ei = < i i i e i

e

e1, 2,..., α > is a sorted sequence of activities logs in time order during Ti and we assume |Ei| = αi ≤ α, for each i.

‘•’ is a concatenation operator, i.e., E1•E2 = < 11 12 1 ,..., 1

,e eα

e , 12 22 2 ,..., 2

,e eα

e >.

E = E1•E2•…•Ew is the whole activities logs we are concerned in T.

e-id is the event identifier which is defined by the triple fields <unique ID, action

target, action>, where unique ID ∈ U, and action target is the target of the user activities, e.g., the item to sale, and the action is the action taken by the user, e.g., POST, GET.

ID(eij) is an extracting function to extract the e-id of eij.

6.1.2 Renumber Sort Algorithm

Since the information of single activities is not sufficient enough to represent the user behavior, several activities with same e-id selected from Ei are first aggregated during Ti and then transformed into a feature vector. However, when we do aggregation in this phase, if a User A do GET action to web page X, and the other action user A

Notations:

i

fj = ReNumSort(Ei) is the jth distinct e-id during Ti. Si < i i i

f i

f

f1, 2,..., β > is a sequence of feature vectors during Ti, where βiαi.

Fi = {fji| for 1 ≤ j ≤ βi}, and F =

U

iw=1Fi.

i

Sqis a subsequence of Si for q ∈ U.

vq = <Sq1,Sq2,...,Sqw> is a behavior vector of uq. V = {vq | for q ∈ U}.

Table 4.1 presents the format of general activities log. The Time field indicates the occurred time of log. The UID field and TARGET field indicate unique ID for each user and the target item performing actions, respectively. The ACTION field indicates the action taken in the activity, for example, in network traffic, the ACTION may be the destination port, which implies the service has been requested by the user, e.g., FTP port is 21, Telnet port is 23, and HTTP port is 80. The information may contains in the activities log is different from applications to applications, for example, for consuming activities, maybe the sale amount, quantity will also be included, and the information can be also used in our algorithm.

Table 6.1: The Format of Standard Log Information.

Time UID TARGET ACTION … … … … …

In aggregating the activities into feature vector, we first sort the original activities database by RENUMBER SORT ALGORITHM to get the distinct e-id during Ti,

saying fji. For each activity during Ti, if there exists a previously defined feature vector entry is equal to the e-id of the activity then replace it by aggregating the information for the same e-id. Otherwise, create and define a new feature vector entry.

The RENUMBER SORT ALGORITHM is shown as follows.

Algorithm 6.1: ReNumberSort algorithm, ReNumSort(Ei)