Rule Pruning - Genetic Algorithm Based Fuzzy ID3 Method

Chapter 2. Genetic Algorithm Based Fuzzy ID3 Method

2.6. Rule Pruning

We have used GA to improve the accuracy of the classification task and decrease the rule number as well. But it is possible that there are still some redundant rules in our rule-base. Thus, a simple but effective scheme for rule minimization is described as follows:

1 ) When an example is classified, we maintain the production value of the membership values and the certainty of each class. For example, is represented that the possibility of the class j estimated by the rule .

P _i (j) i

2 ) For each j (1 ), corresponding to the correct class label of the example gets positive sign and others get negative sign. For example, if the example is class 1 such as is positive and others (P , …) are negative.

ô j ô n

^P i (j)

i

(1)

_i

(2)

3 ) Sum the possibilities of all the classes estimated by the rule

i

(P , , …) then we get the credit of the rule for classifying this example.

i

(1)

_i

(2)

i

4 ) Consider the next test example and repeat from step 1 ) until all the test examples are estimated by rule i and we will get the total credit of rule .

i

ô i ô r

5 ) Repeat from step 1 ) for each (1 ) such that the total credit of all the rules have been computed.

i

6 ) Remove the redundant rules whose total credit is less than a threshold.

For example, after we getting the total credit of each rule of the small training set as shown in Table I, we sort and plot the total credit of all rules as shown in Fig. 2.7.

We find that after about the 5-th rule, the slope takes a sudden turn and slides down rapidly. It means that the 6-th and 7-th rules may be bad or redundant rules. Hence we

can select a threshold between 0.034 and 0.752 then the redundant rules would be removed. The Pruned fuzzy decision tree of the small training set as shown in Table I is shown in Fig. 2.8, and the entire process of our method is schematized in Fig. 2.9.

credit

Random generate fuzzy sets Start

Feature ranking Load training data

Construct the decision trees

Extract rules from the decision trees

GA terminate? No

Validation testing data

Yes

Compute each rule credit

Prune the rule-base

Accuracy acceptable?

No

Tune fuzzy sets by GA but the same feature ranking

Classify test data by remained rules

End Yes

Restore the original rule-base

GA operators

Fig. 2.9. Steps in GA based fuzzy ID3 method.

Chapter 3. Web Log-File Classification

3.1. Introduction to Web-Mining

In the past ten years, the internet has changed a lot, and becomes close related to our daily life. According to the need of human being, the model of internet is more and more completed. Through the internet, people can get any information from the website, and exchange mutual information via internet. Because of convenience and unlimited cyberspace properties, the internet create a new business model

“e-commerce” that changed the traditional business model. If people want to buy or sell something through internet, they can get information on virtual website and need not any actual store. It is an initiative to change the traditional business world. When people connect to website, the web server would record IP of the user, the background of the user, visited web pages, visited time, etc. in its log-file. The format of log-file is detailed specification of W3C referred in [22] and [23]. The web server kept of all consumers’ transactions and records. Hence, the log-file becomes very large through day’s recording day in day out. How to extract important regularity or domain knowledge hidden behind the log-file is important to the enterprise for improving their content services to the potential users.

3.2. Data Preparation

The data mining system [3] is to extract efficient information form web log-file.

We have to transfer the original log-file to some special format suitable for the

database in this system. We selected the Microsoft Access 2000, a relational database, to be the log-file database format. When converted the log-file from text file to Access database file, the attributes (fields) of log-file must be defined. The definition of fields are shown as follows:

from_ip: where the consumer came from.

date: the date that consumer login website.

time: the time that consumer login website.

status: the HTTP status code returned to the client.

dest: the hyperlinks that consumer clicked.

After the converting above, Fig. 3.1 is the log-file table. The consumer table is the basic characteristics of members of the website. The definition of fields in the consumer table are shown as follows:

fromip: the IP of the consumer.

name: the consumer’s name.

sex: the sex of the consumer.

age: the age of the consumer.

education: the education level of the consumer.

occupation: the occupation of the consumer.

income: the total income in a year.

Fig. 3.2 is the table of consumers. The log-file table and the consumer table are two important elements of web log-file mining system. The consumer table provides the basic personal information about a consumer. The log-file table registers the in

and out of website’s pages, whose records will be utilized for analysis later.

Fig. 3.1. The log-file table.

Fig. 3.2. The consumer table.

Web server will record the hyperlink pages that a consumer clicked. According to the hyperlink pages we get six categories of products that include books, compact disk (CD), computer, mobile, PDA, and electric commerce (EC). The metadata table as shown in Fig. 3.3 provided the numeric of hyperlink pages.

Because the log-file data may be in a mess with useful and non-useful data, we cannot use them in data mining algorithm directly. We will take out useful data by transferring it to particular format according to database setting beforehand, and clean out non-useful data. The major purpose of data cleaning is to reduce the redundancy of log-file and to convert the text file into Access database type. Therefore, the standard query language (SQL) can be applied to maintain the log-file database, such as query, insert, update, and delete.

Compare the log-file table with the consumer table and find out the consumer’s personal data such as sex, age,…, etc. Then 2604 examples of the web log-file data will be generated, and as shown in Fig. 3.4. The web log-file data stores records consisted of the log-file, consumer, and metadata tables, and the internal data may be discrete or continuous data. To analyze the web log-file, the training sample table is very important. It provides the information about the consumers’ sex, age, education level, occupation, and total income in one year. It also provides consumers’ spending time and favorite pages. The training samples table also helps us to explain the behavior of consumers. Through the fuzzy sets of values of linguistic variable, GA based fuzzy ID3 method will generate fuzzy inference rules.

Fig. 3.3. The metadata table.

Fig. 3.4. The web log-file data.

3.3. Data Analysis

The web log-file data with 2604 examples has six attributes and six classes named books (class 1), CD (class 2), computer (class 3), mobile (class 4), PDA (class 5), and EC (class 6). There are two continuous attributes, and four discrete attributes, as shown in Table III.

The class distribution of the web log-file data is shown in Fig. 3.5. The highest proportion of the classes is “Books” (44%), and the lowest is “EC” (nearly 0%). Now, we make the statistics of the repeated numbers of each example and the class distribution of the examples as shown in Fig. 3.6. For example, the 14-th example is the same with the 6-th example, and there are nine examples in the training data the same with this example. Among the nine examples, there are three class 1, three class 3, and three class 5. The probability of class 1 of the nine examples is 0.33. The probability of each class is shown in Fig. 3.7.

TABLE III

DETAILS OF THE ATTRIBUTES

Type Attribute Attribute range

Age (years old) {15 - 41}

Continuous

Spend time {0.002227 - 1}

Sex {Man, Woman}

Education {Below junior, Junior, Senior, University, MS/PHD}

Occupation {Student, Public, Finance, Service, Information}

Discrete

Income {Below 20K, 20K-40K, 40K-60K, 60K-80K, 80K-100K}

Fig. 3.5. The class distribution of the web log-file data.

Fig. 3.6. Statistical analysis of the data repeated.

Fig. 3.7. The probability of each class.

The web log-file data is not in good order because that there exist some people with all the same attributes but their favorites are different. This case always exists in the real world. In the testing process, we can predict only one class for each input example. In this data set, we always have some examples that will not be correctly classified, assume a classification of majority class. If we want to distinguish the repeated example clearly, we will need to add another attributes if possible. But it is limited in source acquired of the data. For example, there are ten examples with all the same attributes, that seven examples are in class 1 and three examples are in class 2. If the classified results of the ten examples are class 1, the accuracy will be 70%.

Inevitably there are three examples classified wrongly. Thus our purpose is to find the regularity of the most important ones, i.e., fist choice. We will provide information

about the behavior of the user through the fuzzy ID3 method, and the rule-base obtained has to be simple and efficient.

Chapter 4. Simulation and Experiment

As mentioned in Chapter 2, we introduce fuzzy ID3 algorithm, whose membership functions and leaf condition are tuned by GA. In this chapter, we apply the algorithm to classify some data sets, which include continuous, discrete, and mixed-mode data sets [7], [8]. Finally, we used a daily log-file to analysis and generated the rule-base about consumer behavior to web master to maintain and promote the website content.

4.1. The Data Sets

The ten data sets employed for experiments are obtained from the University of California, Irvine, Repository of Machine Learning databases (UCI) [24]. Their characters are briefly described below and summarized in Table IV.

1 ) Crude_oil: Gerrid and Lantz analyzed Crude_oil samples from three zones of sandstone. The Crude_oil data set with 56 examples has five attributes and three classes named wilhelm, submuilinia, and upper. The attributes are vanadium (in percent ash), iron (in percent ash), beryllium (in percent ash), saturated hydrocarbons (in percent area), and aromatic hydrocarbons (in percent area).

2 ) Glass Identification Database: The data set represents the problem of identifying glass samples taken from the scene of an accident. The 214 examples were originally collected by B. German of the Home Office

Forensic Science Service at Aldermaston, Reading, Berkshire in the UK.

The nine attributes are all real valued and fully known, representing refractive index and the percent weight of oxides such as silicon, sodium, and magnesium. The six classes are named as building windows float processed, building windows not float processed, vehicle windows float processed, containers, tableware, and headlamps.

3 ) Iris Plant Database: The Iris data set, Fisher’s classic test data (Fisher, 1936), has three classes with four-dimensional data consisting of 150 examples. The four attributes are: sepal length, sepal width, petal length, and petal width. This data set gives good results with almost all classic learning methods and has become a sort of benchmark data.

4 ) Myo_electric: The Myo_electric data set is extracted from a problem in discriminating between electrical signals observed at the human skin surface. This is a four-dimensional data set consisting of 72 examples divided into two classes.

5 ) Norm4: The data set has 800 examples consisting of 200 examples each from the four components of a mixture of four class 4-variate normals.

6 ) BUPA liver disorders: This UCI data set was donated by R. S. Forsyth.

The problem is to predict whether or not a male patient has a liver disorder based on blood tests and alcohol consumption. There are two classes, six continuous attributes, and 345 examples.

7 ) Promoter Gene Sequences Database: Promoters have a region where a protein (RNA polymerase) must make contact and the helical DNA sequence must have a valid conformation so that the two pieces of the contact region spatially align. The data set with 106 examples has 57 attributes and two classes. All attributes are discrete.

8 ) StatLog Project Heart Disease dataset: This UCI data set is from the Cleveland Clinic Foundation, courtesy of R. Detrano. The problem concerns the prediction of the presence or absence of heart disease given the results of various medical tests carried out on a patient. There are two classes, seven continuous attributes, six discrete attributes, and 270 examples.

9 ) Golf: The data set with 28 examples has four attributes and two classes named play, and don’t play. There are 2 continuous and 2 discrete attributes. The attributes are outlook, temperature, humidity, and windy.

10) StatLog Project Australian Credit Approval: This credit data originates from Quinlan. This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. The Australian data set with 690 examples has 14 attributes and two classes. There are 6 continuous and 8 discrete attributes.

TABLE IV

SUMMARY OF THE DATABASES EMPLOYED Data set # of examples # of attributes # of continuous

attributes # of classes

Crude_oil 56 5 5 3

Glass 214 9 9 6

Iris 150 4 4 3

Myo_electric 72 4 4 2

Norm4 800 4 4 4 Bupa 345 6 6 2

Promoters 106 57 0 2

Heart 270 13 6 2 Golf 28 4 2 2

Australian 690 14 6 2

4.2. Comparison

The performance of our rule-base on the above data sets is shown in Table V. We use all the examples to be the training data and the same examples to be the testing data for performance evaluation. In rule pruning, we remove redundant rules that can keep or slightly reduce learning accuracy to be considered as acceptable. According to feature subset select [25], for classifying Glass data set, we consider only five attributes that are Na, Mg, Al, K, Ba. If we do not reduce the attributes of Glass data set, we will get more rules after tree construction, but it will not help in increasing the learning accuracy.

TABLE V

PERFORMANCE OF THE RULE-BASE ON DIFFERENT DATA SETS Before rule pruning After rule pruning Data set

# of rules Training acc. # of rules Training acc.

Crude_oil 9.0 100.0 7.0 98.2

Glass 61.0 77.6 12.0 76.2

Iris 7.0 99.3 4.0 98.7

Myo_electric 4.0 98.6 2.0 98.6

Norm4 14.0 97.0 10.0 96.8

Bupa 15.0 76.5 11.0 76.5

Promoters 7.0 85.9 4.0 85.9

Heart 15.0 87.4 12.0 85.9

Golf 9.0 100.0 7.0 100.0

Australian 5.0 87.0 4.0 87.0

From Table V, we can find that for Myo_electric, Bupa, Promoters, Golf, and Australian data sets, the accuracy remains the same before and after rule pruning. For the others, there is a little degradation in accuracy. This has happened possibly

because the rule pruning process has removed some rules, which were correctly classifying a few examples and the residual rule-base is not able to correctly classify these examples. We can also see that the number of the rules is decreased for all data sets, which shows the effectiveness of our rule pruning process.

So far, we have not evaluated the generalization ability of the rules extracted by our scheme. Next, we do so and also compare our results with the outputs of C5.0 [6].

The reason why we choose C5.0 is that C5.0 worked well for many decision-making problems and it was a decent version of C4.5. We use that for each considered data set, 50% of the data is uniformly and randomly chosen as the training set and the remaining 50% of cases is held for testing. This procedure is repeated six times. Note that, C5.0, whose demonstration version is limited up to 400 examples, and free download from RuleQuest Research Data Mining Tools [6]. We use this demonstration version of C5.0 to construct the following tables.

Table VI shows the comparison of the accuracy of our rule-base and that from C5.0. It records the testing accuracy from two-fold cross validation repeated six times on each data set. On average, we find that our rule-base outperforms C5.0 in eight out of ten data sets. Thus our system has a better generalization ability than C5.0 and except for Glass and Bupa. The results of our rule-base and C5.0 were also compared with respect to their average numbers of rules. Table VII compares the numbers of rules generated by our rule-base and C5.0 at the same experiment on these data sets.

We find that our rule-base outperforms C5.0 in five out of ten data sets. But, the total average number of the rules on our rule-base is 7.18, which less than 8.17 of C5.0. It is evident that our approach tends to produce more compact rule sets than C5.0.

TABLE VI

COMPARISON OF THE ACCURACY RATES

Testing acc. (two-fold CV repeated six times) Data set Algorithm

1 2 3 4 5 6 Avg.

acc.

Our rule-base 85.7 87.5 75.0 83.9 73.2 82.1 81.2

^*

Crude_oil

C5.0 76.8 78.6 80.4 80.4 76.8 75.0 78.0

Our rule-base 64.0 66.4 65.4 61.2 64.0 63.1 64.0 Glass

C5.0 65.9 67.8 65.0 67.3 66.4 69.6 67.0

^*

Our rule-base 96.0 93.3 94.7 96.0 95.3 94.0 94.9

^*

Iris

C5.0 92.0 94.7 92.0 92.7 91.3 92.7 92.6

Our rule-base 81.9 93.1 83.3 91.7 91.7 91.7 88.9

^*

Myo_electric

C5.0 83.3 90.3 79.2 86.1 93.1 88.9 86.8

Our rule-base 93.8 94.4 94.4 95.3 94.3 92.5 94.1

^*

Norm4

C5.0 89.8 91.3 91.3 90.6 91.8 89.9 90.8

Our rule-base 60.9 59.7 64.6 64.1 64.4 64.1 63.0 Bupa

C5.0 65.8 62.3 65.8 68.7 63.5 64.0 65.0

^*

Our rule-base 76.4 76.4 68.9 76.4 79.3 75.5 75.5

^*

Promoters

C5.0 75.5 74.5 69.8 71.7 78.3 78.3 74.7

Our rule-base 76.7 80.0 77.4 78.2 78.2 77.0 77.9

^*

Heart

C5.0 74.1 77.0 76.3 77.8 79.6 73.3 76.4

Our rule-base 92.9 67.9 60.7 85.7 82.1 78.6 78.0

^*

Golf

C5.0 82.1 71.4 57.1 71.4 78.6 71.4 72.0

Our rule-base 84.6 84.1 85.5 84.8 84.4 84.4 84.6

^*

Australian

C5.0 83.2 84.5 85.4 85.8 84.8 83.1 84.5

TABLE VII

COMPARISON OF THE NUMBER OF THE RULES

# of rules (two-fold CV repeated six times) Data set Algorithm

1 2 3 4 5 6 Avg.

rules Our rule-base 5.5 5.0 5.5 6.0 5.0 5.5 5.4 Crude_oil

C5.0 4.0 4.0 5.0 4.0 4.5 3.0 4.1

^*

Our rule-base 20.0 12.5 13.0 15.0 16.0 9.0 14.3 Glass

C5.0 10.0 9.5 13.5 7.0 9.5 9.0 9.8

^*

Our rule-base 4.5 3.0 4.5 5.0 5.0 5.0 4.5 Iris

C5.0 4.0 3.5 3.0 4.0 3.0 3.0 3.4

^*

Our rule-base 2.5 2.5 2.5 3.5 2.0 3.5 2.8

^*

Myo_electric

C5.0 3.5 3.0 3.5 4.0 3.5 4.0 3.6

Our rule-base 12.0 9.5 17.0 12.0 13.0 10.0 12.3

^*

Norm4

C5.0 14.5 14.5 13.5 12.5 11.5 14.5 13.5

Our rule-base 5.5 5.0 3.5 7.0 7.0 4.0 5.3

^*

Bupa

C5.0 14.0 9.5 17.0 13.0 16.0 11.0 13.4

Our rule-base 5.0 1.5 3.5 12.5 8.0 8.5 6.5

^*

Promoters

C5.0 9.0 7.0 8.5 8.0 5.5 7.5 7.6

Our rule-base 16.0 9.5 15.0 14.0 7.0 13.5 12.5 Heart

C5.0 11.0 12.0 12.5 11.5 12.5 11.5 11.8

^*

Our rule-base 5.0 3.5 4.5 6.0 5.5 6.5 5.2 Golf

C5.0 5.0 4.5 2.5 2.5 3.0 2.5 3.3

^*

Our rule-base 3.5 3.0 2.5 2.0 3.5 3.5 3.0

^*

Australian

C5.0 8.5 10.5 13.5 11.5 14.0 9.0 11.2

TABLE VIII

THE BEST PERFORMANCE COMPARISON

Our rule-base C5.0 rule-base Data set

# of rules Training acc. Testing acc. # of rules Testing acc.

Crude_oil 5.0 100.0 87.5

^*

4.0 80.4

Glass 12.5 77.6 66.4 9.0 69.6

^*

Iris 4.5 100.0 96.0

^*

3.5 94.7

Myo_electric 2.5 97.2 93.1 3.5 93.1

Norm4 12.0 96.0 95.3

^*

11.5 91.8

Bupa 3.5 72.5 64.6 13.0 68.7

^*

Promoters 8.0 89.6 79.3

^*

5.5 78.3

Heart 9.5 85.2 80.0

^*

12.5 79.6

Golf 5.0 100.0 92.9

^*

5.0 82.1

Australian 2.5 88.6 85.5 11.5 85.8

^*

Table VIII lists the maximum testing accuracy of the six for our rule-base and C5.0 in Table VI. It also shows the corresponding number of the rules in the experiment. With respect to the testing accuracy shown in Table VIII, our rule-base is still superior to C5.0 in six data sets and ties one. The discrete attributes do not assume membership functions to be tuned by GA, the performance of the discrete and mixed-mode data sets are still better than C5.0.

4.3. Classification of the Web Log-File

Now we will use our fuzzy ID3 method to find the regularity of the web log-file.

The training data of this experiment was described in Chapter 3. After training, our method will generate a fuzzy decision tree and the membership functions of each continuous attribute. Through the fuzzy decision tree, we can extract a set of fuzzy rules. When testing, we use the fuzzy rule-base to classify all the examples in the log-file.

To validate the effectiveness of the rule set, we use all of the training data to be the testing data also. Our program interface is shown in Fig. 4.1. If the classifier is towards the majority class classification, the best accuracy we can have 58.99%. The learning accuracy is poor because of high inconsistency existing in the data set. In this data set, there exist some person with all the same attributes but their favorite classes are different. In fact, based on our proposed method, the majority class accuracy is 48.69% after training. The relative accuracy of the major class is 82.54%. The result of classifying is shown in Fig. 4.2. According to the rule credits as shown in Fig. 4.3, if the threshold to prune rule is chosen to be –10, there is only one redundant rule pruned in the rule-base. Then, the accuracy will be reduced to 47.35%. To maintain the accuracy we will not prune any rule in this experiment. Finally, the rule table is shown in Fig. 4.4, and the web master can maintain and improve the website according to these rules we have obtained.

Fig. 4.1. Program interface.

Fig. 4.2. The result of classifying.

Fig. 4.3. The rule credits.

credit

rule_id

Fig. 4.4. The rule table.

There are 12 rules in the rule table, and the result of feature ranking is {Age, Spend time, Occupation, Income, Education, Sex}. The theorem of feature ranking

在文檔中根據基因演算法之Fuzzy ID3方法於混合特徵資料學習 (頁 31-0)