VODKA: Variant objects discovering knowledge acquisition

(1)

VODKA: Variant objects discovering knowledge acquisition

Shian-Shyong Tseng

a,b,*

, Shun-Chieh Lin

a

a_{Department of Computer and Information Science, National Chiao Tung University, No. 1001, Ta Hsueh Road, Hsinchu 300, Taiwan, ROC} b_{Department of Information Science and Applications, Asia University, No. 500, Lioufong Road, Wufong Shiang, Taichung 413, Taiwan, ROC}

Abstract

Many knowledge acquisition methodologies have been proposed to elicit rules systematically with embedded meaning from domain experts. But, none of these methods discusses the issue of discovering new modified objects in a traditional classification knowledge based system. For experts to sense the occurrence of new variants and revise the original rule base, to collect sufficient relevant information becomes increasingly important in the knowledge acquisition field. In this paper, the method variant objects discovering knowledge acquisition (VODKA) we proposed includes three stages (log collecting stage, knowledge learning stage, and knowledge polishing stage) to facilitate the acquisition of new inference rules for a classification knowledge based system. The originality of VODKA is to identify these new modified objects, the variants, from the way that the existing knowledge based system fails in applying some rules with low certainty degree. In this method, we try to classify the current new evolving object identified according to its attributes and their corre-sponding values. According to the analysis of the collected inference logs, one of the three recommendations (including adding a new attribute-value of an attribute, modifying the data type of an attribute, or adding a new attribute) will be suggested to help experts observe and characterize the new confirmed variants. VODKA requires E-EMCUD (extended embedded meaning capturing and uncertainty deciding). EMCUD is a knowledge acquisition system which relies on the repertory grids technique to manage object/attri-bute-values tables and to produce inferences rules from these tables. The E-EMCUD we used here is a new version of EMCUD to update existing tables by adding new objects or new attributes and to adapt the original embedded rules. Here, a computer worm detection prototype is implemented to evaluate the effectiveness of VODKA. The experimental results show that new worm variants could be discovered from inference logs to customize the corresponding detection rules for computer worms. Moreover, VODKA can be applied to the e-learning area to learn the variant learning behaviors of students and to reconstruct the teaching materials in improving the performance of e-learners.

Keywords: Knowledge acquisition; Variant discovering; EMCUD; VODKA; Computer worm

1. Introduction

As time goes on, new objects in many domains are incre-mentally evolving due to the environment changes and the knowledge explosion. Besides unchangeable objects in the course of time, some previously created objects may

become obsolete and disappear after a long period of time, and some new objects might be added or modified for adapting to the dynamic environmental changes with the time, which is the phenomenon of object evolution. In this paper, knowledge can be classified as static knowledge and dynamic knowledge according to the stability of knowledge in a dynamic environment. Static knowledge remains the same while dynamic knowledge will be updated when environment changes. For example, due to the nature of evolution, some well-known virus could evolve into a new virus which, suggested by the experts, should be sin-gled out as a new variant to classify all kinds of viruses definitely.

*

Corresponding author. Address: Department of Computer and Infor-mation Science, National Chiao Tung University, No. 1001, Ta Hsueh Road, Hsinchu 300, Taiwan, ROC. Tel.: +886 3 5712121x56658; fax: +886 3 5721490.

E-mail addresses: sstseng@cis.nctu.edu.tw (S.-S. Tseng), jielin@cis. nctu.edu.tw(S.-C. Lin).

www.elsevier.com/locate/eswa Expert Systems with Applications 36 (2009) 2433–2450

Expert Systems with Applications

(2)

As we know, a knowledge based system (KBS) is an intelligent computer program using knowledge and infer-ence procedures to solve problems which are difficult enough to require significant human expertise for the solu-tions, such as disease diagnosis, investment prediction, or computer science problems. Since the phase of knowledge acquisition is a bottleneck when constructing a KBS, many methodologies and related tools, e.g., NeoETS (Boose & Bradshaw, 1986), AQUINAS (Boose & Bradshaw, 1987), KITTEN (Shaw & Gaines, 1987), EMCUD (Hwang & Tseng, 1990), KADS (Wielinga, Schreiber, & Breuker, 1992), KANAL (Kim & Gil, 2001), OMCS-2 (Singh et al., 2002), INDUS (Caragea et al., 2005) have been pro-posed to rapidly build prototypes in the past twenty years. Most of the existing knowledge acquisition systems employ the repertory grid test, originally developed by per-sonal construct theory (PCT) for eliciting static knowledge, to identify different objects and distinguishing these objects in a selected domain. PCT can help domain experts formu-late the quality of static knowledge with/without knowl-edge engineers. But only the static knowlknowl-edge used for classifying and distinguishing well-known objects by a finite set of training cases in most the systems, such as ID3 (Quinlan, 1986), KAISER (Tsujino, Takegaki, & Nishida, 1990; Tsujino, Dabija, & Nishida, 1992), C-KAT (Zacklad & Fontaine, 1995), version space (Hong & Tseng, 1997), AVT-DTL (Zhang & Honavar, 2003). And few of them attack the topic of acquiring dynamic knowledge, in fact, the incremental learning of dynamic knowledge by continuously being aware of the variant objects as time goes on is important

Consequently, the lack of suﬃcient information about variants may result in a problem of observing the occur-rence of evolving objects for human experts. Now, how to collect suﬃcient relevant information to help experts sense the occurrence of variants and reuse the original rule base becomes one of the important issues.

Embedded meaning capturing and uncertainty deciding (EMCUD), one of a repertory grid based knowledge acqui-sition tools, was proposed to elicit the embedded meanings of knowledge (embedded rules bearing on objects and object attributes) to classify objects. Based upon an attri-bute ordering table (AOT) to record the relative impor-tance of each attribute to each object, EMCUD guides the experts to determine the certainty degree of each embedded rule, in order to extend the coverage of original rules generated by traditional repertory grid.

With different certainty degree, some variant objects modified from original objects may not be classified by the original rules but may be classified by the other embed-ded rules; even these embedembed-ded rules may have margin val-ues of certainty factor (CF) due to the weak suggestions of domain experts. Since the higher the CF value, the more reliable the results are, this kind of objects could be singled out as a variant object class because the new characteristics of these objects are emerging. Therefore, if the variants could be detected and recommended as the new objects

by experts, the related ambiguous attributes (minor attri-butes) could be reﬁned or new attributes could be added to improve the classiﬁcation ability in a KBS. These attri-butes might result in the marginally acceptable CF values of weak embedded rules.

In this paper, a new iterative methodology, variant objects discovering knowledge acquisition (VODKA) is proposed to learn new variant objects based upon monitor-ing the inference behaviors of weak embedded rules. We focus our attention on incrementally collecting new cases of modified objects and learning the dynamic knowledge to modify the original classification KBS. In order to clearly single a new variant object out, the previous meth-odologies need to be applied recurrently, and therefore the originally elicited knowledge might not be easily reused. The goal of VODKA is to facilitate the acquisition of new inference rules of the new modified objects for a clas-sification KBS, from which new modified objects will be identified from the attribute values. The new rules, gener-ated by VODKA, will be able to cope with the new variant objects. They are similar to previously known embedded rules in the KBS.

Since the relative importance of each attribute can be represented by using AOT, some minor attributes will be relaxed or ignored in order to capture the embedded mean-ings of embedded rules with acceptable CF values in EMCUD. In other words, these kinds of attributes are not used to classify the object. New variants classiﬁed by weak embedded rules should be singled out of original objects. Obtained from the collection of minor attribute-value pairs, the ﬁred frequency of weak embedded rules is a big help for the experts to discover new information of variant objects. These attribute-value pairs are then to be provided to help experts determine whether to single variant objects out or not.

According to the analysis of the collected inference logs, one of the three different recommendations: adding a new attribute-value of a minor attribute, modifying the data type of a minor attribute, or adding a new attribute, will be suggested to help experts characterize the corresponding rules to refine the KBS if a new variant object is confirmed to be singled out.

To evaluate the performance of VODKA, a simple com-puter worm detection prototype system with VODKA and worm classification embedded rule base based upon DRAMA (Lin, Tseng, & Tsai, 2003) has been implemented to discover the new variant worms generated by an attack-ing traffic generator. Experts can be helped by VODKA to quickly and easily single out the new variant objects and customize the detection rules for computer worms. Also, VODKA has been applied on an e-learning domain to learn the variant learning behaviors of e-learners, which shows that VODKA is a big help for teachers to prepare or to reconstruct the teaching materials, the variant course materials, for these e-learners to easily understand the learning materials. In a word, VODKA can enhance the classification ability for new objects of a KBS since the

(3)

new evolving objects can be incrementally learned and dis-covered by collecting the suﬃcient information.

2. Related work

Several knowledge acquisition methodologies and related systems are introduced in this section ﬁrst. Then repertory grid, one of the popular indirect knowledge acquisition techniques, is also discussed. Finally, the elicitation of embedded meaning and some problems of traditional knowledge acquisition methodologies are discussed.

2.1. Knowledge acquisition systems

As we know, knowledge in many domains and the expe-rience of domain experts is continuously growing. Many knowledge acquisition methodologies have been proposed to help knowledge engineers acquire the useful knowledge and then to transfer this knowledge into a knowledge base or other computerized representation forms. In general, there are three approaches for knowledge acquisition (Crowther & Hartnett, 1996; Hwang & Tseng, 1990; Mcgraw & Harbison-Briggs, 1989):

(1) Interviewing experts by experienced knowledge engineers: formalizing the elicited knowledge after interviewing experts. However, it is usually time-con-suming if the communication between domain experts and knowledge engineers is insuﬃcient. (2) Machine learning: learning the knowledge by

collect-ing many useful cases and instances with/without the involvement of domain experts. However, the quality of the results usually relies on the selected training cases.

(3) Knowledge acquisition systems: assisting domain experts in generating useful rules using knowledge acquisition systems with/without the help of knowl-edge engineers. These tools could reduce the eﬀort of communication between knowledge engineers and domain experts and could reduce the risk and diﬃculty of selecting the suitable training cases. In the past decades, many knowledge acquisition sys-tems, e.g., NeoETS (Boose & Bradshaw, 1986), AQUI-NAS (Boose & Bradshaw, 1987), KITTEN (Shaw & Gaines, 1987), EMCUD (Hwang & Tseng, 1990), KADS (Wielinga et al., 1992), MCRDR (Kang, 1996), KAMET (Cairo, 1998), MedFrame/CADIAG-IV (Boegl, 1997; Kolousek, 1997; Leitich et al., 2001), KANAL (Kim & Gil, 2001), OMCS-2 (Singh et al., 2002), INDUS (Caragea et al., 2005) have been developed to build prototypes and to iteratively elicit the knowledge from domain experts. However, most of them can not be used to construct the dynamic knowledge to classify the variant objects in a dynamic environment using the previous obtained attri-butes. For acquiring the dynamic knowledge, the

tradi-tional approach needs to rerun their knowledge acquisition process.

Ontology, which defines concepts and the relationships between concepts, was indicated to be useful in a knowl-edge acquisition process in recent years. In computer sci-ence area, the ontology is a conceptualized data structure to be used in knowledge systems. Based on the same ontol-ogy, different systems can communicate with each other, or the knowledge inside computer systems may be structured and presented more accurately. ‘‘An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms” (Uschold, King, Moralee, & Zorgios, 1998).

To build up the domain ontology, identifying the domain concepts is the ﬁrst step. Moreover, a systematical approach to acquire the relations of concepts from exper-tise is also a key factor for ontology construction. In recent years, due to the increasing requirement for inducing domain knowledge into computer systems (Kitamura & Mizoguchi, 2003; Noy & Sintek, 2001), many researches (Alani & Kim, 2003; Eriksson, 2003; Fensel & Angele, 1988; Frank & Farquhar, 1999; Leibbink, Witteman, & Mayer, 2002; Maedche & Staab, 2001) were proposed to discover, represent, and use of ontologies. Especially in KBSs, ontology becomes a key factor to build a successful knowledge base with more meaningful knowledge content for users. Moreover, the software agents technique is used in knowledge acquisition systems to acquire knowledge from autonomous, distributed, and semantically heteroge-neous data sources with a domain speciﬁc ontology (Cara-gea, Silvescu, & Honavar, 2001; Caragea et al., 2005; Rosaci, 2005; Rosaci, Terracina, & Ursino, 2004).

However, ontology used in these researches was usually assumed to be existent or be able to be obtained directly from experts. However, in real cases, when a knowledge engineer starts at designing a knowledge acquisition pro-cess, the ontology of the domain concepts is usually not available yet. It is usually a time-consuming process to build an ontology only by interviewing a domain expert. 2.2. Repertory grid methodology and relevant systems

Repertory grid, based on Kelly’s personal construct theory (Kelly, 1955) which reports how people make sense of the world, could be used as an eﬃcient knowledge acquisition technique in identifying diﬀerent objects and distinguishing these objects in a domain. It is the basis of several computer assisted knowledge acquisition tools, such as ETS (Boose, 1984, 1985), AQUINAS (Boose & Bradshaw, 1987) and KSSO (Gaines, 1987).

A single repertory grid represented as a matrix whose columns have element objects (labels) and whose rows have construct’s attributes (labels) can classify a class of objects, or individuals. The value assigned to an element-construct

(4)

pair need not be Boolean. Grid values have numeric rat-ings, probabilities, and other characteristics, where each value reﬂects a degree. Then, the expert is asked to ﬁll the grid with 5-scale ratings, where ‘‘1” represents the most relevant attribute to the object; ‘‘2” represents that the attribute may be relevant to the object; ‘‘3” represents ‘‘unknown” or ‘‘no relevance”; ‘‘4” represents that the object may have the opposite characteristic; ‘‘5” represents the most relevant opposite characteristic to the object. The whole concept of repertory grid technique can be described with the following steps:

(1) Elicit all of the element objects, e.g., E1, E2, E3, E4, E5

from the expert.

(2) Elicit the construct attributes (and their opposites), e.g., C1; C2; C3; C4ðC10; C02; C03; C04Þ, from the expert.

Each time three elements are chosen to ask for a con-struct to distinguish one element from the other two. (3) Rate all of the [element, construct] entries of the grid with value range from 1 to 5. An illustrative example is given inTable 1.

As repertory grid technique has been widely used by researchers, some extensions have been made to enrich its representative ability for covering more knowledge, the value assigned to an element-construct pair may be Bool-ean, numeric ratings, probabilities, etc. For example,Dixit and Pindyck (1994), andHwang (1995)extended the reper-tory grid technique to the fuzzy table, in which constructs were fuzzy attributes that could be rated by means of fuzzy linguistic terms from a finite set. Castro-Schez, Jennings, Luo, and Shadbolt (2004) developed a technique using a fuzzy repertory grid for acquiring the finite set of attributes or variables that the expert uses in a classification problem to characterize and discriminate a set of elements.

Moreover, several models have been proposed for han-dling uncertainties in expert systems through generating more meaningful rules from the repertory grid oriented approaches. EMYCIN certainty factor (CF) model was first used to determine the degree of the belief of a rule for uncertain reasoning (Shortliffe & Buchanan, 1975). Embedded meaning capturing and uncertainty deciding (EMCUD) knowledge acquisition system was proposed to extract rules with embedded meaning from hierarchical repertory girds by defining the impacts of the constructs of each element (Hwang & Tseng, 1990) and was success-fully applied in a medical diagnostic system of acute exan-thema in Taiwan (Hwang & Tseng, 1991). WebGrid (Shaw

& Gaines, 1996), Calgary’s web-based knowledge modeling and inference tool, is based on repertory grid elicitation and analysis. Blythe, Kim, Ramachandran, and Gil (2001) proposed an acquisition interface that integrates previously developed techniques to guide users to set con-straints in diﬀerent aspects of knowledge acquisition.

Although these methodologies are proposed to extend the ability of uncertain reasoning to classify the well-known objects, none of them discusses the issue of discov-ering and classifying new variant objects. It is also diﬃcult for experts to sense the occurrence of variant knowledge, which is modiﬁed in the dynamic environment as time goes on. Therefore, a new knowledge acquisition system based upon EMCUD is proposed in this paper to guide domain experts to create additional attributes for classifying new variant objects through the observations of the interested inference results.

2.3. Elicitation of embedded meanings

The embedded meanings referred to here represent the information that domain experts take for granted but which is implicit to the people who are not familiar with the application domain. The lack of embedded meaning will probably make an expert system fail to infer some cases being trivial to experts. SEEK (Politakis & Weiss, 1984) and SEEK2 (Ginsberg, Weiss, & Politakis, 1988) have been proposed to obtain embedded meanings by some efficient refinement processes. However, the major problem of SEEK and SEEK2 is that the case database is assumed to be available although it is difficult to collect sufficient cases in some applications. Moreover, it would be also time-consuming and boring for experts to offer a conclu-sion for each case in the database before starting the refine-ment procedure. Thus, EMCUD (Hwang & Tseng, 1990) is proposed to elicit the embedded meanings of knowledge from the existing hierarchical repertory grids given by experts. Additionally, it will also guide experts to determine the certainty degree of each rule with embedded meaning for extending the coverage of generated original rules.

To capture the embedded meanings of the resulting grids, the Attribute Ordering Table (AOT), which is used to record the relative importance of each attribute to each object, is employed. The values in each AOT entry, a pair of attribute and object, may be labeled ‘‘X”, ‘‘D” or an integer number. ‘‘X” means no relationship existing between the attribute and the object. ‘‘D” means that the attribute dominates the object, i.e., if the attribute is not equal to the entry value, it is impossible for the object to be implied. Integer numbers are used to represent the rela-tive importance degree of the attribute of the object instead of dominating the corresponding object. If the attribute does not equal the attribute-value, it is still for the object to be implied. A larger integer number implies that attri-butes must be more important to the object.

Using AOT, the original rules generate some rules with embedded meaning, and the CF value of each rule, which is

Table 1

The illustrative example of a repertory grid with ratings

Element construct E1 E2 E3 E4 E5

C1 5 1 5 1 1 C01

C2 4 4 4 1 4 C02

C3 1 4 5 1 4 C03

(5)

between 1 and 1, could be determined to indicate the degree of supporting the inference result. The higher the CF value, the more reliable the results are. The EMCUD algorithm is listed as follows.

Algorithm 1. EMCUD algorithm Input: The hierarchical grids.

Output: The guiding rules with embedded meaning. Step 1: Build the corresponding AOT with each grid of the hierarchical multiple grids.

Step 2: Generate the possible rules with embedded meaning.

Step 3: Select the accepted rules with embedded meaning through the interaction with experts.

Step 4: Generate automatically the CF of each rule with embedded meaning.

All rules generated by EMCUD can be categorized into two classes: original and embedded rules with acceptable CF value, and discarded rules with unacceptable CF value, according to the conﬁdence degree of domain experts. To determine the CF value of each embedded rule, we have to ﬁrstly determine the upper and the lower bounds of CF values of accepted embedded rules. CF value of each rule can be automatically determined by a fuzzy mapping function. Thus, the useful embedded rules with acceptable CF values could be used to cover more uncertainty cases.

Since embedded rules with weak acceptable CF values (the CF values below a user defined threshold) usually mean that domain experts might lack strong confidence, objects matching weak embedded rules derived from origi-nal objects may be the candidates for new variants. For example, the object satisfying the conditions (attribute-value pairs) of the embedded rules with CF = 0.5 means the expert might suggest that it would be marginally classi-fied into the object class and the minor attributes of the embedded rule might be not clearly defined. Therefore, the fired frequencies of this kind of weak embedded rules should be used to discover the occurrence of new variant objects.

2.4. Problems with conventional knowledge acquisition methods

With the changing environment, the adaptation of the acquired rules should be required to cope with new vari-ants. However, experts may not be aware of the occurrence of new variant candidates and may have insuﬃcient evi-dence to construct the knowledge of the variants using con-ventional repertory grid approaches. Although EMCUD could be used to generate more useful embedded rules for covering more similar objects in extended object classes, it still lacks the ability of grid evolution for singling these new variants out; e.g., EMCUD should manually regener-ate the original and embedded rules to classify these variant objects by interacting with domain experts after collecting

suﬃcient information about these variants. Therefore, enhancing the adaptation ability of embedded rules becomes increasingly important to achieve the ability of grid evolution in a classiﬁcation KBS.

In this paper, the embedded rules from EMCUD are categorized into three classes: the original rules with strong CF values, the embedded rules with marginally acceptable CF values, and the discarded rules with low CF values. Hence, a new knowledge acquisition methodology is pro-posed to discover the occurrence of new variant objects using the ﬁred frequency of embedded rules with margin-ally acceptable CF values. A simple computer worm detec-tion prototype in Example 1 is used to illustrate the inability for discovering variants using EMCUD.

Example 1. The example of classifying four computer worms.

In recent years, the number of computer worms is dramatically increasing to threaten the reliability of Inter-net.Table 2 shows the acquisition table of four computer worms (Kienzle & Elder, 2003; Mirkovic, Martin, & Reiher, 2002; Moore, Shannon, Voelker, & Savage, 2003; Weaver, Paxson, Staniford, Cunningham, & Maulik, 2003) including Nimda, CodeRed, Blaster, and Welchia using five attributes including 100-thread, System reboot, DoS type, Email attached file, and TCP port. The 100-thread means 100-threads with Boolean are simultaneously exe-cuted by one program. The system reboot Boolean attribute will be set to True if the system has been automatically rebooted. The attacking methodologies of worms could be classified into one kind of DoS type with String attribute (Mirkovic et al., 2002). The email attached file attribute with Set data type is also a useful attribute to classify these worms. Most of the worms could communi-cate with each other using different TCP port with Set data type.

An example of constructing an AOT table from the acquisition table shown inTable 2is given as follows:

EMCUD: If DoS type is not equal to Email ﬂood, is it possible for Nimda to be implied?

EXPERT: No.

Table 2

The acquisition table of four computer worms

Attribute Object

Nimda CodeRed Blaster Welchia

100-thread (A1) X True X X

System reboot (A2)

X True True True

DoS type (A3) Email ﬂood TCP ﬂood Windows

Update flood ICMP flood Email attached file (A4) {sample.exe; puta!!scr} X X X TCP port (A5) X {80} {135; 4444} {80; 135}

(6)

The answer means the DoS type dominate Nimda, and hence AOT [Nimda, DoS type] = ‘‘D”.

EMCUD: If Email attached ﬁle is not equal to any ele-ment of {sample.exe, puta!!scr}, is it possible for Nimda to be implied?

EXPERT: YES.

The answer means that Email attached file does not dominate Nimda. The questions for 100-thread and Nimda will not be asked, since the entry [Nimda, 100-thread] is labeled ‘‘X”. Therefore, the entry AOT [Nimda, 100-thread] is labeled ‘‘X”, too. This is the same for AOT [Nimda, System reboot] and AOT [Nimda, TCP port] entries. The entry AOT [Nimda, Email attached file] is set to be 1, since the Email attached file is the only attribute that does not dominate Nimda. If there are more than one attributes do not dominate the object, e.g., the System reboot, the DoS type, and the TCP port do not dominate Blaster, the following questions will be asked by EMCUD.

(1) Is System reboot more important than DoS type? (2) Is System reboot less important than DoS type? (3) Is System reboot as important as DoS type?

The expert indicates that System reboot is as important as DoS type to Blaster. Moreover, the expert also indicates

that System reboot is more important than TCP port to Blaster; and hence the entries AOT [Blaster, System reboot] = AOT [Blaster, DoS type] = 2 and AOT [Blaster, TCP port] = 1. After each entry value of AOT is deter-mined as shown inTable 3, the embedded meaning implied by the AOT can be extracted.

Now we use the ﬁrst column of Table 2 to show the information implied by an AOT. The column expresses the following meanings:

(1) A3 dominates Nimda: If A3 is not equal to Email

ﬂood, it is impossible for Nimda to be implied. (2) A4does not dominate Nimda: If A4is neither equal to

sample.exe nor puta!!scr, it is still possible for Nimda to be implied.

In practice, the hierarchy rules can be generated while hierarchical grids are given. To simplify the discussion, Table 4 shows partial detection rules (simple rules) of a classiﬁcation KBS based upon the Tables 2 and 3 to classify these worms using ﬁve attributes in single grids. Ri,j

represents the jth highest rank of CF in object i, and the highest rank is 0. The R1,0is the original rule of Nimda to

classify the original Nimda objects and R1,1is the

embed-ded rule of Nimda to classify the extenembed-ded Nimda objects. The Mask Table of minor attributes shown in Table 5 indicates the minor attributes for all embedded rules. Each row in Mask Table is a bit vector of attributes, where the ith bit is set to 1 representing the ith minor attribute that is negated or ignored. For example, the M2,3 (0, 1, 1, 0, 0)

means the 2nd and 3rd minor attributes in R2,3are ignored.

On the Internet, each worm can be represented as a set of attribute-value pairs. We can automatically collect such attribute-value pairs and feed them into the classiﬁcation KBS to classify them in the suitable category. Since new worms might have been derived from old worms, the diﬀerence between their attribute values seems to be slight. As mentioned above, EMCUD could generate lots of

Table 3

The AOT table of four computer worms

Attributes Object

Nimda CodeRed Blaster Welchia

A1 X 2 X X A2 X 1 2 2 A3 D 1 2 1 A4 1 X X X A5 X X 1 2 Table 4

Partial detection rules generated by EMCUD

Rule # Conditions Conclusion CF

A1 A2 A3 A4 A5 Object

R1,0 – – Email ﬂood (sample.exe; puta!!scr) – Nimda 0.8

R1,1 – – Email ﬂood : (sample.exe; puta!!scr) – Nimda 0.4

R2,0 True True TCP ﬂood – – CodeRed 0.8

R2,1 True False TCP ﬂood – – CodeRed 0.6

R2,2 False True TCP ﬂood – – CodeRed 0.4

R2,3 True False : (TCP ﬂood) – – CodeRed 0.4

R3,0 – True Windows update ﬂood – {135; 4444} Blaster 0.7

R3,1 – True Windows update ﬂood – :{135; 4444} Blaster 0.57

R3,2 – False Windows update ﬂood – {135; 4444} Blaster 0.43

R3,3 – True : (Windows update ﬂood) – {135; 4444} Blaster 0.43

R3,4 – False Windows update ﬂood – :{135; 4444} Blaster 0.3

R4,0 – True ICMP ﬂood – {80; 135} Welchia 0.8

R4,1 – True : (ICMP ﬂood) – {80; 135} Welchia 0.67

R4,2 – True ICMP ﬂood – :{80; 135} Welchia 0.53

(7)

embedded rules with diﬀerent CF values for accommodat-ing the knowledge of the changed worms due to the property of minor attributes; e.g., R1,1‘‘IF (DoS type =

E-mail flood) AND : (Email attached file = (sample.exe; puta!!scr)) THEN Nimda”, a marginally acceptable embed-ded rule with CF = 0.4, may be fired by a new Nimda variant which is treated as a member of original Nimda class. If this rule has been fired frequently due to a specific value of the attribute ‘‘email attached file”: ‘‘readme.exe” (more evidence of the occurrence of the candidates of Nimda variants have been gathered), a new original rule and an embedded rule could be generated: ‘‘IF (DoS type = Email flood) AND (Email attached file = read-me.exe) THEN Nimda.B”, a subset of extended Nimda object class namely Nimda.B, with CF = 0.8 and ‘‘IF (DoS

type = Email ﬂood) AND : (Email attached ﬁle = read-me.exe) THEN Nimda.B” with CF = 0.5. These rules help to single the Nimda.B class out of the extended Nimda object class.

3. Knowledge acquisition by discovering variant objects Although EMCUD and other similar approaches could be manually rerun to acquire variant knowledge from domain experts to classify new variant objects, it might be costly and hard to obtain the knowledge due to the insuﬃcient information about variants. To simplify our discussion, let’s assume some objects in O1 class belong

to the original object class (OO1) of O1, which can be

clas-siﬁed by original rules of O1. The other objects in O1class

classiﬁed by embedded rules of O1belong to the extended

object class (EO1) of O1, where OO1 EO1. In the EO1,

some modiﬁed objects can be classiﬁed by the embedded rules of O1 with weak CF values, which are singled out

to be a variant object class (VO1) of O1with the signiﬁcant

attributes emerged from minor attributes. That is, VO1 EO1and VO1\ Oi= /, where 1 6 i 6 m, and m is

the number of distinct object classes. Because the embed-ded rules with diverse CF values represent diﬀerent sup-ports to classify objects, the ones with marginally acceptable CF values might be triggered by some candi-dates of new variant classes. Therefore, our idea is to ana-lyze the behaviors of weak embedded rules (the weak suggestions by experts) to construct the new variants acqui-sition table for extracting new variant knowledge.

In the following, we will propose a new iterative knowl-edge acquisition methodology, variant objects discovering

Table 5

The Mask Table of ignored attributes

Mask # A1 A2 A3 A4 A5 M1,0 0 0 0 0 0 M1,1 0 0 0 1 0 M2,0 0 0 0 0 0 M2,1 0 1 0 0 0 M2,2 1 0 0 0 0 M2,3 0 1 1 0 0 M3,0 0 0 0 0 0 M3,1 0 0 0 0 1 M3,2 0 1 0 0 0 M3,3 0 0 1 0 0 M3,4 0 1 0 0 1 M4,0 0 0 0 0 0 M4,1 0 1 0 0 0 M4,2 0 0 0 0 1 M4,3 0 1 0 0 1 Embedded Rule Base Real Instances Inference Engine (DRAMA) Inference Log Experts New Variants Acquisition Stage III: Knowledge Polishing Stage I: Log Collecting Stage II: Knowledge Learning Meta Knowledge New Variants Acquisition Table User Inter-face User Adjust Extended EMCUD Main Acquisition Table Update

(8)

knowledge acquisition (VODKA) which is an iterative pro-cess as shown inFig. 1, to provide the ability of grid evo-lution, where each iteration consists of three stages, log collecting stage, knowledge learning stage, and knowledge polishing stage, to analyze the inference behaviors of the embedded rules in a KBS.

Firstly, the embedded rule base will be created according to the original main acquisition table using EMCUD or VODKA. Then the inference behaviors (facts/attribute-value pairs) will be collected iteratively to discover the can-didates of the variants during Stage I according to the meta knowledge. The ignored attribute-value pair of the minor attribute will be treated as an item and a set of ignored attribute-value pairs will be treated as a transaction to dis-cover the association between interesting attribute-value pairs in Stage II. Consequently, the new variants acquisi-tion table based on the discovered knowledge could be gen-erated by interacting with domain experts through the new variants acquisition procedure. Finally, the rules applicable to new variants will be incrementally generated and the main acquisition table will be iteratively adjusted using E-EMCUD in Stage III. The algorithm of VODKA is shown as follows.

Algorithm 2. The algorithm of VODKA

Input: The original main acquisition table T and embed-ded rule base RB.

Output: The rules with embedded meaning about variants.

Stage I: Collect all facts of the weak embedded rules as real inference log of the RB.

Stage II: Generate the new variants acquisition table T0. Step 1: Discover large itemsets L using the inference log.

Step 2: Generate T0_{using L and additional attributes}

provided by experts.

Stage III: Use E-EMCUD to generate rules of new variants.

Step 1: Generate rules according to T0_.

Step 2: Merge T0into original main acquisition table T.

3.1. Meta knowledge in log collecting stage

Without loss of generality, let’s assume that there are k attributes to classify m objects in the main acquisition table. Thus, the total number of embedded rules is limited. In order to assist domain experts in noticing and analyzing the occurrence of the candidates of variant objects, the fol-lowing four meta rules are used to collect the inference log (fact/raw data) of weak embedded rules to help experts sense the occurrence of new variants.

MR1: IF Ri,j is ﬁred THEN Increase Ci,jby one.

MR2: IF CF(Ri,j) 6 THCF, THEN Log Ri,j.

MR3: IF Ci,jPTHcnt AND CF(Ri,j) 6 THCF THEN

Run New Variants Acquisition Algorithm to acquire the new variants acquisition table and Reset TimeOut. MR4: IF TimeOut = THPeriodTHEN Run New Variants

Acquisition Algorithm and Reset TimeOut.

The meta rule MR1is used to count the ﬁred frequency

of each embedded rule (Ci,j). The meta rule MR2 means

that all facts (attribute-value pairs) of the embedded rules with marginally acceptable CF lower than strong CF bound threshold (THCF) are logged as a record, (Ri,j,

A1, A2, . . . , Ak, CF(Ri,j)). The meta rule MR3 means that

if there exists one weak embedded rule with ﬁred frequency exceeding the warning line threshold (THCNT), new

vari-ants may be discovered iteratively using VODKA. The meta rule MR4means that VODKA will be executed

peri-odically to refresh the new variants acquisition table. The TimeOut will be reset when MR3or MR4is triggered.

3.2. New variants discovering methodology at the knowledge learning stage

At the knowledge learning stage, the new variants acqui-sition table will be generated through interacting with domain experts based upon the observation of inference log. An ignored attribute-value pair, i.e., (DoS type = TCP ﬂood), is treated as an item and the transaction is repre-sented as a set of ignored attribute-value pairs, i.e., {(DoS type = TCP ﬂood), (TCP port = {135; 4444})}. The inference log could be automatically transformed into the transaction database (D) and the item set (I) using the Mask Table of ignored attributes. In order to obtain the candidates of new variants, we apply Apriori algorithm (Agrawal, Imielinksi, & Swami, 1993) to discover large itemsets (L) that will provide more useful information.

After generating the large itemsets, new variants acqui-sition table might be elicited following the new variants acquisition algorithm. The new objects using unclear attri-butes would be singled out accordingly, if the experts reconfirm the addition of the new variant object. Thus, one of three recommendations including adding a new attribute value to a minor attribute, modifying the data type of a minor attribute, adding a new attribute, will be further given to adjust the main acquisition table. If a new changing object is singled out, the new value of the minor attribute could be added to characterize of new objects. If the initial data type of a certain attribute is too rough to describe the object, a superior data type is rec-ommended and the values of the attribute in both original object and variant should be modified. For example, the BOOLEAN data type may be refined to SINGLE VALUE data type (Hwang & Tseng, 1990). If changing the data type still can not discriminate the new variants from origi-nal objects, acquiring a new attribute from domain experts will be suggested in VODKA. Thus, the new variants acquisition table will be created iteratively using the discov-ered large itemsets inAlgorithm 3. However, adding a new

(9)

attribute, which is very time-consuming because it results in creating a new row in new variant acquisition table, is the last choice for classifying variant objects.

Algorithm 3. New variants acquisition algorithm

Input: Inference log and the main acquisition table T, the minimal support d.

Output: The new variant object class VO, new attribute set AN, and new variants acquisition table T0_.

Step 1: Transform inference log into the transaction data set D.

Step 2: Discover large itemsets L by d using D.

Step 3: For each large itemset, ask experts to determine whether it belongs to a new variant or not.

Step 4: If a new variant is conﬁrmed, ask experts to acquire the related information about this new variant.

Store VOnewin VO.

Add a new column to represent the new variant VOnewin T0.

Ask experts to conﬁrm whether changing the data type of attribute Aiis needed or not, where

1 6 i 6 k, and k is the number of attributes.

Step 4.1: If no data type needs to be changed, Suggest Recommendation I.

Add a new value of Aiof the VOnew.

Else, ask experts to conﬁrm whether adding a new attribute is needed or not.

Step 4.2: If no new attribute needs to be added, Sug-gest Recommendation II to modify the data type of Ai.

Ask experts to acquire the mapping function of values between original and new data types. Add a column in T0_{to represent the original object}

with new mapping values. Else, Suggest Recommendation III.

Ask experts to acquire the values of the new attri-bute Anew.

Add a new row in T0_{to represent the new variant}

Anew.

Store Anewin AN.

Add a column in T0_{to represent the original object}

with new attribute value if needed. Step 5: Return VO, AN, T0.

Besides interaction with domain experts, the computa-tional cost of the algorithm is dependent on Step 2. The size of collecting inference log and minimal support threshold setting will affect the computational cost. For different types of inference logs, different learning algorithms can be selected to learn and discover the candidate behaviors of variants.

Example 2. The variant learning example of a Blaster worm

In this example, let’s assume that the ﬁred sequence of some embedded rules of Blaster worms with marginally acceptable CF values is given inTable 6.

Let’s assume minimal support is set to 30%; the large itemsets L including L1= {(A2= False); (A5= X)} and

L2= {(A2= False, A5= X)} will be provided to experts for

further recommendation; i.e., L will be used to generate the new variants acquisition table T0 according to the recom-mendations suggested by VODKA.

VODKA: Does the attribute-value pair (A2= False)

belong to any new variant object?

EXPERT: Yes. /*It means that a new variant contains the selected attribute-value pair (A2= False). Otherwise,

the large itemset is discarded and another large itemset is chosen to be examined. */

VODKA: What is the name of the new variant object? EXPERT: VOnew.

A new column will be added in T0 to represent the variant, VOnew, separated from the original object.

VODKA: Is the data type of A2required to be changed?

EXPERT: No. /* It means the data type does not need to change after adding new variant (Recommenda-tion I). Otherwise, VODKA will ask the following ques-tions. */

The row representing A2with new attribute value and one

column representing the variant object will be created in T0_. VODKA: Is any new attribute required to be added?

Table 6

The partial inference logs of Blaster

Rule # A1 A2 A3 A4 A5 Object CF

R3,4 17 False Windows Update ﬂood – X Blaster 0.3

R3,2 100 False Windows Update ﬂood – {135} Blaster 0.43

(10)

EXPERT: No. /* It means no need to add new attribute and Recommendation II is then suggested. Otherwise, VODKA will suggest Recommendation III. */

Recommendation II.

VODKA: What is the new name and new value set of the attribute A2?

EXPERT: NDTnew, VDTnew. /* For each old value in A2,

VODKA will ask experts to deﬁne the mapping between old and new value sets. */

The row representing A2with new mapping values and two

columns representing the original and variant objects will be created in T0_.

Recommendation III

VODKA: What is the name and value set of the new attribute-value pair?

EXPERT: Anew, AVnew. /* VODKA will ask experts to

provide a set of values (AVnew) of the new attribute

Anew. */

A new row representing the useful attribute namely Anew

with a set of value (AVnew) and two columns representing

original object and new variant will be added in T0_. _If

all large itemsets are conﬁrmed, the new variant acquisition table T0_{can be used to generate embedded rules of}

discov-ered variants in the knowledge polishing stage. 3.3. Knowledge polishing using extended EMCUD

Based upon the new variants acquisition table, we pro-pose extended EMCUD (E-EMCUD) algorithm as shown inAlgorithm 4to generate new embedded rules and adjust the original embedded rule base. Therefore, we can grace-fully update the embedded rule base using the small new variants acquisition table instead of using the whole large main acquisition table.

Algorithm 4. The E-EMCUD algorithm

Input: The original main acquisition table T, rule base RB, the new variants acquisition table T0_{, variant}

objects VO, and new attributes AN, and THCF.

Output: The adjusted RB including rules with embedded meaning of new variants.

Step 1: Build the corresponding AOT of T0_.

Step 2: Generate the embedded rules with embedded meaning of T0_.

Step 3: Generate the embedded rules with CF values including the original rules with strong CF and embed-ded rules with marginally acceptable CF.

Step 4: Merge the new variants acquisition table T0into main acquisition table T.

Step 4.1: Append VO and AN as new columns and rows in T, respectively.

Step 4.2: Ask experts to ﬁll the values of the modiﬁed attributes of other objects in T if necessary.

Step 4.3: Ask experts to examine the values of the new attributes of other objects in T if necessary. Step 5: Reset the new variants acquisition table T0.

Based upon the AOT table generated in Step 1, the embedded rules generated in Step 2 will be classiﬁed into original and embedded rules in Step 3. Then, the new vari-ants acquisition table will be merged with the main acqui-sition table in Step 4. Finally, the new variants acquiacqui-sition table will be reset in Step 5 to learn other variants. 3.4. The analysis of VODKA

The cost of running VODKA can be divided into two categories: computational cost and interaction cost. Assume there are k attributes to classify m objects in the original main acquisition table, where the grid size is k*_{m. For simplifying our discussion, we use ER}

k,mto

rep-resent the total number of embedded rules in the classiﬁca-tion KBS, where ERk,m< m*2

k

.

During the log collection stage, assume n instances are matched by the classiﬁcation KBS. For each instance, it has Peprobability to be classiﬁed by weak embedded rules;

hence the size of interesting inference log database in VODKA is n*_P

e.

During the knowledge learning stage, the computational cost is dominated by the learning algorithm we selected. In this paper, Apriori algorithm is used to learn the candi-dates of variant worms. Hence, the computational cost is O(Aproori). For example, if the size of database has n transactions (each transaction has k attributes) and the maximal length of large itemsets is len, then the time com-plexity of traditional Apriori algorithm is O(n*_k*_len).

Assume L large itemsets are discovered and used to notify experts to determine the existence of the variants. For each embedded rule, assume it has Pv probability to

evolve a variant; hence P

vERk;m, denoted V, variants might

be discovered. Therefore, the order of interaction with experts is V, where V < L.

During the knowledge polishing stage, the E-EMCUD integrates new acquisition table into original acquisition table. The computational cost of E-EMCUD for generat-ing rules is dependent on the size of new acquisition table (GRID), denoted O(GRID). For example, using our E-EMCUD to generate embedded rules, it costs 0.05 ms 0.15 ms to generate one rule.Fig. 2 shows that the time for generating rules using different grid sizes. The computa-tional time is approximately linear growing when setting a fixed attribute number with different numbers of objects in Fig. 2a, and the growth rate is exponential when setting a fixed object number with different numbers of attributes inFig. 2b.

(11)

In short conclusion, the cost of VODKA consists two parts: computational cost and interaction cost. The size of interesting inference log database: n*_P

e.

Computational cost: O(Apriori) + O(GRID). Interaction cost: V.

4. VODKA implementation and case studies

Two case studies including computer worms and e-learning domain are used to evaluate the performance of VODKA.

4.1. VODKA implementation

VODKA is implemented by DRAMA (Lin et al., 2003), a new object-oriented rule based system platform imple-mented using pure Java language, to refine the embedded rule base by observing the inference behaviors of weak embedded rules. It includes DRAMA server, console, knowledge extractor, and rule editor. Also, it provides an application programming interface (API) to access DRAMA server in DRAMA integrated systems. There are four basic relations between knowledge concepts defined in DRAMA: Reference, Extension-of, Trigger and Acquire. The Reference relation represents the associ-ation of two different knowledge classes (KCs) if the KCs have common piece of knowledge, which is useful for using original knowledge to construct new knowledge. Exten-sion-of relation is used to extend or modify the KC con-structed by other people, which is useful for knowledge sharing and exchanging. The Trigger and Acquire relations are used to represent the interaction of different KCs. The log collecting stage is encoded by four meta-rules described in Section 3.1 in DRAMA; the knowledge learning stage and E-EMCUD are implemented using the JSP to make a communication channel using the API provided by DRAMA.

4.2. The case study of computer worms

With the rapid development of network technology, the network security becomes one of the most important issues

today. To prevent network environment from intrusions, lots of researches and different systems are proposed to detect, filter, or prevent intrusions properly. Recently, the computer worms become more complicated owing to com-bining several signatures of previous known worms. The number of computer worms has grown dramatically to influence the wide computer networks due to the property of easily modifying the source codes of original computer worms to create new variants for escaping the detection of related systems, e.g., Symantec Norton (Symantec, 2005), Network Viruswall (Trend Micro, 2005), etc. They are very difficult for experts to get and analyze the signa-tures because they have incredibly sophisticated character-istics (Kienzle & Elder, 2003; Moore et al., 2003; Weaver et al., 2003). Some researchers have proposed machine learning, data mining, and clustering approaches to dis-cover and learn outliers or abnormalities (the new types of objects) (Lin, Tseng, & Lin, 2002).

Generally speaking, a computer worm usually self-prop-agates via a network to acutely generate a crisis in our sys-tems through the following four stages: target selection, exploitation, infection, and propagation (Weaver et al., 2003). At the Target Selection Stage, a worm performs reconnaissance and simply probes potential victims to see if it’s running a service on a particular port. If the service is running, the worm goes to exploitation stage, in which it compromises the target by exploiting a particular vulner-ability and published exploits. If it succeeds, the worm goes to infection stage, in which it sets up on the newly infected machine. Finally, in propagation stage, the worm starts to spread by choosing new targets. And another victim will enter the next four stages cycle.

In our worm detection prototype system, the knowledge of computer worms can be divided into several KCs, including the service provided by the host may be infected by certain worms and then produced some symptoms in host or network. The related attributes of various com-puter worms can be collected by some probe tools and used to evaluate the ability of VODKA, which deployed in the prototype system.

In order to evaluate VODKA system, an experimental environment shown in Fig. 3 for detecting various com-puter worms is built. In this environment, the victim could receive both the normal traﬃc and the attacking traﬃc

(a) Various number of objects (b) Various number of attributes

2000 1500 1000 500 0 2000 2500 3000 1500 1000 500 0 5 10 15 20 Objects Time (ms) Time (ms) Attributes = 10 Objects = 10 Attributes 25 30 5 7 9 10 11 12

(12)

(various worm behaviors). All received traffic can be trea-ted as normal or attacking behavior, which can be trans-formed as attribute-value pairs. The network traffic collected from Internet is assumed as normal traffic since most attacking behaviors with significant signatures will be filtered by a firewall. The attacking traffic generator is designed to randomly generate various worms attacking traffic to infect the victim. Besides the attacking traffic, some signatures, e.g., the system status, host vulnerability information, and large e-mailing behavior, of the victim infected by worms can be also collected. The probe, such as Nessus (Tenable, 2005), is also used to automatically collect these worm related attributes (symptoms). The clas-sification rules of 15 kinds of worm families, including ori-ginal worms and some variant worms, are extracted using EMCUD and then these worm classification rules are stored into the worm KB. Then, these attributes are used to trigger the corresponding classification rules in the worm KB. If variant worms occurred frequently in a period of time, some candidate worm variants may be discovered by VODKA. Finally, the corresponding embedded rules of variant worms confirmed by experts will be generated to update the worm KB.

To evaluate the effectiveness of VODKA, we generated 20 kinds of test samples including the behaviors of 15 kinds of original worm families and five kinds of new worm fam-ilies to randomly attack the victims. The experimental result shows that VODKA can detect 100% of original worms since the classification rules are stored in the worm KB. For detecting the occurrence of variant worms, VODKA can learn the 85% of variant worms. However, VODKA can detect 80% of new worm families in our experiments because the significant difference of new family can not be discovered easily. Thus, deploying the most complementary configurations in a collaborative framework could be effi-cient in discovering the new created worms.

The following example shows that the variant objects in this domain can be discovered by VODKA, where THCNT

is set to 4, THCFis set to 0.7, and the minimal support is set

to 30%.

Recommendation I. Elicitation of new variants by adding a new attribute value of a minor attribute.

Nimda worm, a famous Email ﬂooding worm, can be propagated to victims through the attached ﬁles in email.

By monitoring the attached ﬁlename in email, we can dis-cover the large itemset L = (A4= readme.exe) shown in

Table 7according to the embedded rule R1,1 inTable 4.

Based upon the large itemsets, VODKA will ask the fol-lowing questions.

VODKA: Does the attribute-value pair (A4=

read-me.exe) belong to any new variant object? EXPERT: Yes.

VODKA: What is the name of the new variant object? EXPERT: Nimda.B.

EXPERT: No.

Consequently, the new variant acquisition table of Nim-da.B shown inTable 8will be generated after interviewing the experts in this iteration.

Hence, an original rule ‘‘IF (DoS type = Email flood) AND (Email attached file = (readme.exe)) THEN Nim-da.B, CF = 0.8” and an embedded rule ‘‘IF (DoS type = Email flood) AND: (Email attached file = (readme.exe)) THEN Nimda.B, CF = 0.5” of the Nimda.B will be gener-ated using E-EMCUD based upon the Nimda.B acquisi-tion table.

Recommendation II. Elicitation of new variants by chang-ing the data types of attributes.

In the priori generation of CodeRed worm, generating numerous threads to attack the victim through launching TCP ﬂooding is one of the famous characteristics. Hence, it is useful to detect the CodeRed by analyzing the gener-ated anomaly threads in the protected system. For the par-tial detection rules of CodeRed shown in Table 9, the following shows how the values in Boolean data type will be logged as the integer value of the attribute A1 instead

of true/false value.

Using the above inference log, the large itemset L = (A1= 600) can be discovered based upon the

embed-ded rule R2,2 since the minimal support 30% is satisﬁed.

Then, VODKA will ask the following questions.

VODKA: Does the attribute-value pair (A1= 600)

belong to any new variant object? EXPERT: Yes.

VODKA: What is the name of the new variant object?

(13)

EXPERT: CodeRed.II.

VODIKA: Is the data type of A1required to be changed?

EXPERT: Yes.

VODKA: Is any new attribute required to be added? EXPERT: No.

VODKA: Can the Single Value data type be used to change the original Boolean data type of A1?

EXPERT: Yes.

VODKA: What is the new name and new value set of the attribute A1?

Expert: Threads, (100, inﬁnite).

VODKA: What is the new value of the original True value of the attribute A1in CodeRed?

EXPERT: 100.

Therefore, the new variant acquisition table of CodeRe-d.II shown inTable 10 will be generated.

Consequently, an original rule ‘‘IF (Threads = 600) AND (System reboot = True) AND (DoS type = TCP

ﬂood) THEN CodeRed.II, CF=0.9” and an embedded rule ‘‘IF : (Threads = 600) AND (System reboot = True) AND (DoS type = TCP ﬂood) THEN CodeRed.II, CF = 0.3” will be generated to classify the CodeRed.II. Recommendation III. Elicitation of new object by adding new attributes.

As mentioned in Example 2, we can obtain the large itemsets L = {(A2= False); (A5= X); (A2= False,

A5= X)}, which will be used to elicit the embedded rules

of the new variant. The symbol ‘‘X” means no attribute value of A5is logged, similar to ‘‘Do not care” attribute,

and (A2= False, A5= X) will also be pruned too.

There-fore, VODKA system will ask the following questions. VODKA: Does the attribute-value pair (A2= False)

belong to the new variant object? EXPERT: Yes.

VODKA: What is the name of the new variant object?

Table 7

The partial inference logs of Nimda

R1,1 6 True Email ﬂood readme.exe {137} Nimda 0.4

R1,1 100 False Email ﬂood sample1.exe {25} Nimda 0.4

R1,1 14 False Email ﬂood readme.exe {25} Nimda 0.4

R1,1 44 False Email ﬂood hash.exe {25} Nimda 0.4

R1,1 100 False Email ﬂood inter.exe {25} Nimda 0.4

Table 8

The new variant acquisition table of Nimda.B

Attributes Object

Nimda.B

Threads X

System reboot X

DoS type Email ﬂood

Email attached ﬁle {readme.exe}

TCP port X

Table 9

The partial inference logs of CodeRed

R2,2 150 True TCP ﬂood – {80} CodeRed 0.4

R2,1 100 False TCP ﬂood – {80} CodeRed 0.6

Table 10

The new variant acquisition table of CodeRed.II

Attributes Objects

CodeRed CodeRed.II

Threads 100 600

System reboot True True

DoS type TCP ﬂood TCP ﬂood

Email attached ﬁle X X

(14)

EXPERT: Blaster.B.

EXPERT: Yes.

VODKA: Is any new attribute required to be added? EXPERT: Yes.

VODKA: What is the name and value set of the new attribute?

EXPERT: UDP port, (0, 65535).

VODKA: What is the value of the UDP port attribute in Blaster.B and Blaster?

EXPERT: 69, X (means Do not care).

Hence, the Blaster.B acquisition table shown inTable 11 is generated.

Consequently, a new original rule ‘‘IF (System reboot = Flase) AND (DoS type = Windows update ﬂood) AND (UDP port = {69}) THEN Blaster.B, CF = 0.8” and an embedded rule ‘‘IF (System reboot = Flase) AND (DoS

type = Windows update ﬂood) AND: (UDP port = {69}) THEN Blaster.B, CF = 0.5” of new variant Blaster.B will be generated to classify Blaster.B.

As shown inTable 12, four variants (Nimda.B, CodeR-ed.II, Blaster.B, and Welchia.II) have been successfully sin-gled out using VODKA after several iterations. Table 13 shows the AOT table after interacting with domain experts using E-EMCUD, andTable 14shows the new embedded rule base of the discovered variants and original worms. If more real instances can be used, the embedded rule base will evolve and become more precise for classifying the computer worms.

4.3. The case study of e-learning

With the vigorous development of the Internet, e-learn-ing systems include-learn-ing online learne-learn-ing, employee traine-learn-ing courses, and e-book, have become more and more popular over the world in the past ten years (Beishuizen & Stout-jesdijk, 1999; Hwang, 1999; Yoshikawa, Shintani, & Ohba, 2000). As we know, if the same teaching materials (the knowledge of teachers) are provided to all e-learners based on the predefined strategies or the predefined learning maps, the learning efficiency will be diminished. Therefore, teachers want to apply appropriate teaching strategy to provide personalized learning content and learning sequence for e-learners to improve their learning efficiency. Thus, adaptive learning environments (Shang, Shi, & Chen, 2001; Sheremetov & Arenas, 2002; Triantafllou, Poportsis, & Demetriadis, 2003; Tsai & Tseng, 2002) have been proposed to offer different teaching materials for

dif-Table 11

The new variant acquisition table of Blaster.B

Attributes Objects

Blaster Blaster.B

100-thread (A1) X X

System reboot (A2) True False

DoS type (A3) Windows Update

ﬂood

Windows Update ﬂood

Email attached ﬁle (A4)

X X

TCP port (A5) {135; 4444} {135; 4444}

UDP port (A6) X 69

Table 12

The adjusted main acquisition table of simple computer worms

Attributes Objects

Nimda Nimda.B CodeRed CodeRed.II Blaster Blaster.B Welchia Welchia.II

Threads X X 100 600 X X X X

System reboot X X True True True False True True

DoS type Email flood Email flood TCP flood TCP flood Windows

Update flood Windows Update flood ICMP flood ICMP flood Email attached file {samle.exe; puta!!scr} {readme, exe} X X X X X X TCP port X X X X {135; 4444} {135; 4444} {80;135} {80; 135; 445; 3127} UDP port X X X X X 69 X X Table 13

AOT table of simple computer worms

Attributes Objects

Nimda Nimda.B CodeRed CodeRed.II Blaster Blaster.B Welchia Welchia.II

Threads X X 2 2 X X X X

System reboot X X 1 1 2 2 2 2

DoS type D D 1 1 2 2 1 1

Email attached ﬁle 1 1 X X X X X X

TCP port X X X X 1 1 2 2

(15)

ferent learners in accordance with their aptitudes and eval-uation results. After learners learned the teaching materials through the adaptive learning environment, the teachers can further analyze the historical learning records and then reﬁne or reorganize the teaching materials and tests if needed. Therefore, more and more attention has been paid to the research of personalized instruction in computer education environment. However, it is diﬃcult to monitor the change of learner’s behaviors for teachers quickly.

As we know, the quiz for learners is useful to evaluate their learning achievement. For example, teachers should provide easier learning content or learning sequence for the learners with lower learning achievement. Hence, VODKA provides a good idea to assist teachers in observ-ing the occurrence of variant learnobserv-ing behaviors of e-learn-ers through a sequence of grades of online quiz.

In this case, the objects to be classified are defined as learning behaviors of learners, where each behavior con-sists of profiles, learning sequence, and a quiz grade of the learner, where learning sequence can be used to gener-ate approximgener-ate teaching mgener-aterials for matching the lear-ner’s needs. The learners can be firstly clustered into several groups according to the similarity of the learning behaviors, and teachers can provide appropriate learning content for each group in advance. However, e-learners might change their learning sequences due to the different learning situation, learning equipments (desktop, PDA, etc.), course content (text, video, etc.), and learning time (day or night). This causes the evolution of learning behav-iors of e-learners and results in various learning achievements.

Assume that the on-line testing system is implemented in an intelligent e-learning system. VODKA can monitor the

variant learning behaviors to evaluate the learning perfor-mance of each e-learner, where each learner has different learning sequences. The teachers will be then notified to generate a suitable learning sequence and apply these mate-rials on those e-learners with variant learning behaviors. Here, each learning sequence deviated from one of prede-fined learning sequences will be treated as a variant learn-ing behavior. In e-learnlearn-ing, it is important for e-learners to gain a good grade after learning some materials with a specific learning sequence. Hence, the grade of quiz is trea-ted as a CF value for collecting these good variant learning sequences. If many learners gained grades higher than a threshold with similar or same learning sequence (high fre-quency), then some good variant learning sequences will be discovered to notify teachers to determine these new learn-ing sequences. Therefore, the log is collected as the pair of hLSi, CFii, where LSiis the learning sequence of the

e-lear-ner i; and the CFiis the grade of this learner. Example 3

illustrates the concept of e-learning using VODKA.

Table 14

The rules generated fromTables 12 and 13

R1,0: IF (DoS type = Email ﬂood) AND (Email attached ﬁle = (sample.exe; puta!!scr)) THEN Nimda, CF = 0.8

R1,1: IF (DoS type = Email ﬂood) AND (Email attached ﬁle =: (sample.exe; puta!!scr)) THEN Nimda, CF = 0.4

R2,0: IF (Threads = 100) AND (System reboot = True) AND (DoS type = TCP ﬂood) THEN CodeRed, CF = 0.8

R2,1: IF (Threads = 100) AND: (System reboot = True) AND (DoS type = TCP ﬂood) THEN CodeRed, CF = 0.6

R2,2: IF: (Threads = 100) AND (System reboot = True) AND (DoS type = TCP ﬂood) THEN CodeRed, CF = 0.4

R2,3: IF (Threads = 100) AND: (System reboot = True) AND : (DoS type = TCP ﬂood) THEN CodeRed, CF = 0.4

R3,0: IF (System reboot = True) AND (DoS type = Windows update ﬂood) AND (TCP port = {135; 4444}) THEN Blaster, CF = 0.7

R3,1: IF (System reboot = True) AND (DoS type = Windows update ﬂood) AND: (TCP port = {135; 4444}) THEN Blaster, CF = 0.57

R3,2: IF (System reboot = False) AND (DoS type = Windows update ﬂood) AND (TCP port = {135; 4444}) THEN Blaster, CF = 0.43

R3,3: IF (System reboot = True) AND: (DoS type = Windows update ﬂood) AND (TCP port = {135; 4444}) THEN Blaster, CF = 0.43

R3,2: IF (System reboot = False) AND (DoS type = Windows update ﬂood) AND: (TCP port = {135; 4444}) THEN Blaster, CF = 0.3

R4,0: IF (System reboot = True) AND (DoS type = ICMP ﬂood) AND (TCP port = {80; 135}) THEN Welchia, CF = 0.8

R4,1: IF (System reboot = True) AND: (DoS type = ICMP ﬂood) AND (TCP port = {80; 135}) THEN Welchia, CF = 0.67

R4,2: IF (System reboot = True) AND (DoS type = ICMP ﬂood) AND: (TCP port = {80; 135}) THEN Welchia, CF = 0.53

R4,3: IF (System reboot = True) AND: (DoS type = ICMP ﬂood) AND : (TCP port = {80; 135}) THEN Welchia, CF = 0.4

R5,0: IF (DoS type = Email ﬂood) AND (Email attached ﬁle = readme.exe) THEN Nimda.B, CF = 0.8

R5,1IF (DoS type = Email ﬂood) AND: (Email attached ﬁle= readme.exe) THEN Nimda.B, CF = 0.5

R6,0: IF: (Threads = 600) AND (System reboot = True) AND (DoS type = TCP ﬂood) THEN CodeRed II, CF = 0.9

R6,1: IF (Threads = 600) AND (System reboot = True) AND (DoS type = TCP ﬂood) THEN CodeRed II, CF = 0.3

R7,0: IF (System reboot = False) AND (DoS type = Windows update ﬂood) AND (UDP port = {69}) THEN Blaster.B, CF = 0.8

R7,1: IF (System reboot = False) AND (DoS type = Windows update ﬂood) AND: (UDP port = {69}) THEN Blaster.B, CF = 0.5

R8,0: IF (System reboot = True) AND (DoS type = ICMP ﬂood) AND (TCP port = {80; 135; 445; 3127}) THEN Welchia II, CF = 0.8

R8,1: IF (System reboot = True) AND (DoS type = ICMP ﬂood) AND: (TCP port = {80; 135; 445; 3127}) THEN Welchia II, CF = 0.5

Table 15

The learning sequence of students

Student ID Learning sequence

1 hB, C, A, D, E, F, G, H, I, Ji 2 hA, B, H, D, E, F, C, G, I, Ji 3 hA, D, F, G, H, B, C, I, Ji 4 hA, B, D, E, C, F, G, Hi 5 hA, C, J, F, B, H, D, E, I, Gi 6 hB, H, F, D, E, A, G, C, Ii 7 hA, J, E, H, B, C, I, D, Gi 8 hB, C, G, E, A, H, D, I, J, Fi 9 hC, E, G, F, J, B, H, A, Di 10 hB, C, A, J, D, E, G, H, F, Ii