• 沒有找到結果。

dWwsD h Tu{s j

N/A
N/A
Protected

Academic year: 2022

Share "dWwsD h Tu{s j"

Copied!
41
0
0

加載中.... (立即查看全文)

全文

(1)

國 立 中 央 大 學

資 訊 工 程 研 究 所 碩 士 論 文

應 用 卡 方 獨 立 性 檢 定 於 關 連 式 分 類 問 題

指 導 教 授 : 張 嘉 惠 博 士 研 究 生: 張 毓 美

中 華 民 國 九 十 一 年 六 月

(2)

國立中央大學圖書館 碩博士論文授權書

(91 年 5 月最新修正版)

本授權書所授權之論文全文與電子檔,為本人於國立中央大學,撰寫 之碩/博士學位論文。(以下請擇一勾選)

( ∨ )同意 (立即開放)

( )同意 (一年後開放),原因是:

( )同意 (二年後開放),原因是:

( )不同意,原因是:

以非專屬、無償授權國立中央大學圖書館與國家圖書館,基於推動讀 者間「資源共享、互惠合作」之理念,於回饋社會與學術研究之目的,

得不限地域、時間與次數,以紙本、光碟、網路或其它各種方法收錄、

重製、與發行,或再授權他人以各種方法重製與利用。以提供讀者基 於個人非營利性質之線上檢索、閱覽、下載或列印。

研究生簽名:張 毓 美

論文名稱:應用卡方獨立性檢定於關連式分類問題 指導教授姓名:張 嘉 惠

系所 :資 訊 工 程 所 o 博士 n 碩士班 學號:89522034

日期:民國 91 年 7 月 12 日

備註:

1. 本授權書請填寫並親筆簽名後,裝訂於各紙本論文封面後之次頁(全文電 子檔內之授權書簽名,可用電腦打字代替)。

2. 請加印一份單張之授權書,填寫並親筆簽名後,於辦理離校時交圖書館(以 統一代轉寄給國家圖書館)。

3. 讀者基於個人非營利性質之線上檢索、閱覽、下載或列印上列論文,應依 著作權法相關規定辦理。

(3)

摘要

分類問題一直是機器學習領域中的主要問題。近年來,由於關連式規則挖掘 技術的興起,使得越來越多的研究以關連式規則挖掘的技術來解決分類問題。在 本篇論文中,我們研究及探討幾個關連式分類問題的方法,並且提出一個新的分 類方法,此方法稱為 ACC(意即「應用卡方獨立性檢定於關連式分類問題」)。ACC 利用關連式規則挖掘技術找出所有頻繁且有趣的項目集,利用這些項目集建立屬 性與屬性之間的關係。除此之外,ACC 利用卡方獨立性檢定來檢測屬性與類別之 間的關係,以保留與類別相關的頻繁集來做預測。我們使用 UCI 機器學習資料庫 中的 13 個資料庫進行實驗,將我們的方法( ACC)與 NB 及 LB 兩種高效率及高正 確性的方法做比較。實驗結果顯示,我們的方法在大多數的資料庫上優於 NB 及 LB,亦是一種高效率及高正確性的分類方法。

(4)

Abstract

For many years, classification is one of the key problems in machine learning research. Since association rule mining is an important and highly active data mining research, there are more and more classification methods based on association rule mining techniques. In this thesis, we study several association based classification methods and provide the compar- ison of these classifiers. We present a new method, called ACC (i.e. Association based Classification using Chi-square independence test), to solve the problems of classification.

ACC finds frequent and interesting itemsets, which describe the relations between attributes.

Moreover, it applies chi-square independence test to remain class-related itemsets for pre- dicting new data objects. Besides, ACC provides an approach that considers the probability of missing value occurrence to solve the problem of missing value. Our method is experi- mented on 13 datasets from UCI machine learning database repository. We compare ACC with NB and LB, the state-of-the-art classifiers and the experimental results show that our method is a highly effective, accurate classifier.

(5)

誌謝

首先要感謝的是指導教授 張嘉惠老師的細心指導與教誨,以及 陳國棟教 授、 洪炯宗教授給予的授業指導,使我在資料庫實驗室兩年來,學習到作研究 的精神與方法,並得以順利完成論文研究。感謝口試委員 陳彥良教授、陳銘憲 教授、許鈞南教授、顏秀珍教授於百忙之中審閱論文,並提供建議使本論文更加 完善。

此外,特別感謝圖書館的 周芳秀主任對我的關懷和鼓勵,同時,還要感 謝的是同窗好友 士賢、東軒、龍凱、雍益、仁皓、良一、廉勳、國炎、世賢、

傳正、晶月、惠玲等人,在這兩年來大家互相鼓勵、扶持,讓我感受到實驗室的 溫暖與向心力,也感謝學弟們 智凱、國瑜、志龍、聰鑫、釋謙在這一年來的幫 忙,感謝這些人陪我一起走過研究所的這段日子。

最後,要感謝我的父母、家人以及其他親朋好友給我的支持與鼓勵,讓我 能有個安心的環境順利完成學業。

謹將這份榮耀獻給所有關心我的家人與朋友。

國立中央大學資訊工程研究所 資料庫系統實驗室 研究生 張毓美 僅誌於中大 2001 年 7 月

(6)

Contents

1 Introduction 1

1.1 Association Rule Mining . . . 1

1.2 Concepts of Association Based Classification . . . 2

1.3 Method and Goal . . . 2

1.4 Organization of the Thesis . . . 3

2 Related Work 4 2.1 NB - Naive Bayes Classifier . . . 4

2.2 LB - Large Bayes Classifier . . . 5

2.3 CBA - Classification Based on Associations . . . 6

2.4 CMAR - Classification Based on Multiple Association Rules . . . 7

2.5 Comparison . . . 7

2.6 Summary . . . 8

3 Classification method 10 3.1 Learning Phase . . . 12

3.1.1 Discovering Frequent Itemsets . . . 12

3.1.2 Discovering Interesting Itemsets . . . 12

3.1.3 Discovering Class-Related Itemsets . . . 13

3.2 Classification Phase . . . 14

3.3 Learning Algorithm . . . 15

3.4 Classification Algorithm . . . 17

3.5 Zero Counts Smoothing . . . 19

4 Experimental Results and Discussion 20 4.1 Parameter Setting . . . 21

4.2 Experimental Results . . . 22

4.3 Discussion . . . 22

4.3.1 The Effect of Missing Value . . . 22

4.3.2 The Effect of Parameter Setting . . . 26

5Conclusion and Future Work 31

i

(7)

List of Figures

2.1 the comparison of four classifiers . . . 9

3.1 possible product approximation of P (a1, a2, a3) . . . 11

3.2 The model built by two approximations. . . 11

3.3 Contingency table for itemsets X . . . . 14

3.4 Incremental construction of a product approximation for P (a1, . . . , a5, ci) . . 15

3.5 The Learner algorithm . . . 16

3.6 The Classifier algorithm . . . 18

4.1 Description of datasets . . . 21

4.2 Average accuracy . . . 23

4.3 Accuracy for various values of interestingness threshold under minimum sup- port is set to 1% . . . 24

4.4 Accuracy for various values of interestingness threshold under minimum sup- port is set to 0 . . . 25

4.5 The effect of missing value . . . 27

4.6 The effect of missing value ( Cont.) . . . 28

4.7 Number of itemsets used by ACC in different interestingness . . . 29

4.8 The effect of interestingness threshold . . . 29

4.9 The effect of minimum support threshold . . . 30

ii

(8)

List of Tables

2.1 Meaning of the symbols used . . . 8

iii

(9)

Chapter 1 Introduction

Classification is one of the key problems in data mining and machine learning research.

Given a training set where each data object in the set has a class label, the task is to build a model, called classifier, to predict the class label for new data objects. In previous studies, there have been many techniques devised to build accurate classifiers, for example, decision trees, Bayesian classifier, and neural network, etc.

Association rule mining is an important and highly active area of data mining research.

The techniques of frequent itemset discovering in association rule mining are widely used in many applications. Since there exist some common concepts between association rule and traditional classification, several recent studies apply these concepts to integrate association rule mining techniques into the problem of traditional classification. In this thesis, we call such an integrated framework association based classification.

1.1 Association Rule Mining

Association rule mining extracts a set of rules that satisfy the user-specified minimum support and minimum confidence constrains. It consists of two steps. The first step is frequent itemset discovering, and the second step is association rule discovering. Let I be a set of items, I ={i1, i2, . . . , im}. Let a database D be a set of transactions, where each transaction T is a subset of I. If an itemset X is contained in a transaction T (X ⊆ T ), the transaction T is said to support X. If the number of transactions support itemset X is equal to or greater than the minimum support threshold, the itemset X is called a frequent itemset. An association rule X → Y , where X ⊆ I, Y ⊆ I, and X ∩ Y = ∅, has support s and confidence

1

(10)

CHAPTER 1. INTRODUCTION 2 c if s% transactions contain X∪ Y and c% of transactions that contain X also contain Y .

1.2 Concepts of Association Based Classification

In association based classification, we are interested in a particular kind of association rules called class association rules (CAR). Let A = {a1, a2, . . . , am} be the set of attributes, C = {c1, c2, . . . , cn} be the set of class labels, A ∩ C = ∅, we denote as the union of A and C. The class association rule mining is to extract rules X → c that satisfy the support and confidence thresholds, where X is the set of attributes (X ⊆ A), and c is a class label (c ∈ C). Note that the itemset X in class association rule X → c is a frequent itemset. To discover class assocaiton rule mining, we first discover non-class frequent itemsets.

From the perspective of probability, the support of an itemset X is the probability that the itemset X occurs; the support of an association rule is simply the probability that a rule sustains. The confidence of an association rule X → Y is the conditional probability of Y given X has occurred. To classify an object X is equivalent to compute the conditional probability P (c|X) for each class c, i.e., the confidence for class association rule X → c.

Therefore, class association rule mining can be applied to solve the problem of classification based on the theory of probability. The simplest classifier based on probability is Naive Bayes (NB) classifier [2]. Reasearch on association based classification, including Large Bayes (LB) [7], CBA [6], and CMAR [5], will be discussed in chapter 2.

1.3 Method and Goal

In this thesis, we propose a classifier, called ACC, (i.e. Association based Classification using Chi-square independence test). The classifier ACC is similar to LB in mining association between attributes. LB uses association mining to find frequent and interesting itemsets and applies several heuristics to select itemsets for probability approximation. Although LB is an accurate and efficient classifier, the heuristic conditions for classifying a new data object are complicated. Therefore, we propose a simplified method to solve the classification problem but still hold high accuracy and high performance.

Our method, ACC, uses Chi-square independence test to analyze the relations between

(11)

CHAPTER 1. INTRODUCTION 3 attributes and classes. Then choosing class-related itemsets for probability approximation.

We conduct several experiments on 13 datasets from UCI machine learning dataset repository [8]. The experimental results shows that our method is more accurate than NB and LB.

1.4 Organization of the Thesis

The rest of this thesis is organized as follows. In Chapter 2, we provide the related works of association based classification and discuss the difference between these classifiers. Chapter 3 presents our method in detail. Chapter 4 presents the experimental results of ACC and compares its accuracy with LB and NB. The thesis is concluded in Chapter 5.

(12)

Chapter 2

Related Work

Data classification is a two-step process. In the first step, a model is built to describe a set of training data. In the second step, the model is used for classification. To compare various algorithms, one can measure the predictive accuracy, speed, robustness, scalability, etc. In this chapter, we described two recent approaches extended from association mining.

The first approach is based on class association rule. The earliest one is CBA proposed by Bing Liu, Wynne Hsu, and Yiming Ma in 1998 [6]. CMAR is an extension proposed by Wenmin Li, Jiawei Han, and Jian Pei in 2001 [5]. CBA builds a classifier that is just the set of class association rules. CMAR extends CBA but considers the multiple rules for a new data object. The second approach is based on frequent itemsets to sift interesting ones for probability predication. LB is the representing work of this approach. Since LB is an extension of NB that applies association mining to find the long itemsets, we first give an introduction to Naive Bayes, then compare recent studies.

2.1 NB - Naive Bayes Classifier

Let C be the set of class label C = {c1, c2, . . . , cm}, each data object has a set of attribute values X = {a1, a2, . . . , an}. NB classifies a new data object X to a class ci that has the highest posterior probability P (ci|X).

According to Bayes theorem P (H|X) = P (X|H)∗P (H)/P (X), the posterior probability P (ci|X) is equivalent to P (ci)P (X|ci)/P (X). Since the denominator P (X) for each class ci is the same, the calculation of P (ci|X) can be simplified to calculate P (ci)P (X|ci). NB assumes that the effect of an attribute value on a given class is independent of the values of

4

(13)

CHAPTER 2. RELATED WORK 5

the other attributes. Therefore, P (ci|X) can be approximated by the following formula:

P (ci|X) = P (ci)P (a1, a2, . . . , an|ci)

= P (ci)P (a1|ci)P (a2|ci) . . . P (an|ci) = P (ci)nj=1P (aj|ci) (2.1) The independent assumption makes the calculation quite easier.

Learning phase: Given a training set consists of a set of attributes A = {a1, a2, . . . , an} and a set of class label C = {c1, c2, . . . , cm}, calculate P (ci) and P (ai|ci). The classifier records all of these probabilities for the next phase.

Testing phase: Given a new object X ={a1, a2, . . . , ak}, use Equation 2.1 to calculate P (ci|X) for each ci and then classify to the class with highest P (ci|X).

2.2 LB - Large Bayes Classifier

LB [7] is an extension of NB. It applies association mining to find frequent and interest- ing itemsets. LB relaxes the assumption of attribute independence. For example, sup- pose we know a1, a2, a3 are dependent, but are independent to a4 and a5. To approximate P (a1, a2, a3, a4, a5|ci), we can compute the product P (a1, a2, a3|ci)∗ P (a4|ci)∗ P (a5|ci).

The main idea of LB is to find the relation between attributes using frequent itemsets discovering. In order to avoid huge frequent itemsets, LB adopts a variation of cross-entropy IP −P for two probability distributions P and P as an interestingness measure to prune itemsets.

How do we approximate P (ci|X) from a set of interesting frequent itemsets? This de- pends on what independence assumptions are considered first. For a set of interesting fre- quent itemsets {a2, a5}, {a1, a2, a3}, and {a1, a4, a5}, the following are two valid product approximations of P (a1, a2, a3, a4, a5|ci), depending on what itemset we considered first.

1. {a1, a2, a3}, {a1, a4, a5} ⇒ P (a1, a2, a3, ci)P (a4, a5|a1, ci)

2. {a1, a2, a3}, {a2, a5}, {a1, a4, a5} ⇒ P (a1, a2, a3, ci)P (a5|a2, ci)P (a4|a1, a5, ci)

Therefore, the order of selecting frequent itemset for products is important. LB uses four conditions to determine the order of itemsets. When the selected itemsets are all of size one, LB collapses to NB. The computation for interestingness measure and selecting itemsets is much more expensive than NB.

(14)

CHAPTER 2. RELATED WORK 6

Learning phase: Given a training set consists of a set of attributes A = {a1, a2, . . . , an} and a set of class label C = {c1, c2, . . . , cm}, discover the set of interesting and frequent itemsets, called F I from the attribute set (i.e., each itemset in F I consists of attribute items only). Each itemset in F I records its interestingness value and its counts in each class. The classifier keeps F I for the next phase.

Testing phase: Determine the frequent itemsets F I in an order according to the four conditions specified in [7]. Approximate P (a1, a2, . . . , ak, ci) for each ci using F I according to the order. Calculate all products, and then classify to the highest class.

2.3 CBA - Classification Based on Associations

The idea of CBA [6] is to generate a set of class association rules (CAR) as a classifier. A class association rule (call ruleitem) is of the form, < condset, c >, which represents a rule, condset→ c, where condset is a set of attribute items, c is a class label. If a new data object X matches the condset (i.e., that X ⊇ condset), then classify the object X to class c.

CBA consists of two parts, a rule generator (called CBA-RG) and a classifier builder (called CBA-CB). Rule generator is based on Apriori algorithm and finds all ruleitems. Two ruleitems may have the same condset, but different c. The classifier builder ranks each rule and determines the default class for the classifier as follows.

Rule Rank: Given two rules r and r, written r r (called r has higher rank than r) if the following conditions hold:

1. the confidence of r is greater than that of r, or

2. their confidences are the same, but the support of r is greater than that of r, or 3. both the confidences and supports of r and r are the same, but the size of r is smaller than r.

Default Class: Let r be an itemrule, and D be the training data, Dr is the set of data objects that support rule r. The default class is the majority class in the remaining of D minus Dr.

Learning phase: Given a training set consists of a set of attributes A = {a1, a2, . . . , an} and a set of class label C ={c1, c2, . . . , cm}, mining class association rules with the highest confidence for each class. All rules are ranked by the confidence, support, and the size of

(15)

CHAPTER 2. RELATED WORK 7 itemset. The classifier is of the form < r1, r2, . . . , rn, ci, def ault class >.

Testing phase: Given a new object A = {a1, a2, . . . , ak}, find the first matching rule, and then classify to that class.

2.4 CMAR - Classification Based on Multiple Associ- ation Rules

As the name implies, CMAR [5] is similar to CBA, but considers multiple association rules that matched by a new data object. If the rules are not consistent in class labels, CMAR divides the rules into groups according to class labels. All rules in a group share the same class label. Each matching rule has a χ2 value. For each rule R : P → c, let sup(c) be the number of data objects in training data set that associatied with class label c and |T | the number of data objects in the training data set. As following, CMAR define maxχ2 to computer the upper bound of χ2 for rule R.

maxχ2 = (min{sup(P ), sup(c)} −sup(P )sup(c)

|T | )2|T |e (2.2)

where

e =

sup(P )sup(c)1 +sup(P )(|T |−sup(c))1

+(|T |−sup(P ))sup(c)1 +(|T |−sup(P ))(|T |−sup(c))1

(2.3) CMAR sums up χ2 value of rules in a group for each group. The group with the highest χ2 value is used to predict a new data object.

Learning phase: similar to CBA but not ranks the rules

Testing phase: Given a new object A = {a1, a2, . . . , ak}, find the set of matching rule and divide the rule into groups according to class labels. Using weighted χ2 to analysis the groups and choose the group with the highest χ2 value.

2.5Comparison

In this section, we compare four algorithms NB, LB, CBA, and CMAR in Figure 2.1. We compare the space complexity and time complexity for learning phase. For testing phase, we compare the performance, classifier complexity, and accuracy. The symbols in Figure 2.1 are described as follows:

(16)

CHAPTER 2. RELATED WORK 8

Symbol Description

c the number of classes I the set of total attribute items Lk the set of frequent k-itemsets Ck the set of candidate k-itemsets F I the set of total frequent itemsets CAR the set of class association rules

Table 2.1: Meaning of the symbols used

Figure 2.1 shows the worst case for each classifier. In the learning phase, NB uses the least space since it only records the count for each item. LB uses the most since the probability approximation may use the information of Lk−1 and Lk−2. CBA and CMAR both generate Ck by using Lk for each length k. In terms of speed, NB only scans database once and the others should scan k times.

In the testing phase, NB records all attribute items for each classes. LB holds the counts of all frequent and interesting itemsets with each class and an extra space for interestingness value of each itemset. CBA and CMAR store the class association rules generated from learning phase. The time complexity for each method depends on the space they take since the worst case is searching the whole space.

In terms of classifier complexity, NB is the easiest classifier, which times all conditional probabilities for each class. The next is CBA, which finds the first matching rule and then predicts. The third is CMAR since it should consider multiple matching rules. LB is more complicated than others since it uses four heuristic conditions.

The accuracy comparison is made from the experimental results presented in the liter- atures. LB outperforms NB and CBA as shown in [7] and CMAR is more accurate than CBA as presented in [5]. CBA is better than NB as described in [9]. LB and CMAR are not compared in the literatures.

2.6 Summary

In this section, we summarize the comparison of time complexity, space complexity, accuracy, and classifier complexity for the four algorithms. From Figure 2.1, we found that considering

(17)

CHAPTER 2. RELATED WORK 9

Phase NB LB CBA CMAR

Learning Space I Lk−2+ Lk−1+ Ck Lk−1+ Ck Lk−1+ Ck phase Time 1 DB scans k DB scans k DB scans k DB scans

Space cI |c + 1|F I CAR CAR

Testing Time O(I) O(FI) O(CAR) O(CAR)

phase Classifier complexity

easiest most complex easy middle

Accuracy Lowest better than CBA better than NB

better than CBA

Figure 2.1: the comparison of four classifiers

the relationship between items will achieve higher accuracy. NB is the easiest and the most efficient classifier but has lowest accuracy since it assumes all attributes are independent.

Second, CBA is also an easy classifier since it only considers one matching rule and prefers high confidence rules. Usually, high confidence rules are rules with low support. Therefore, CMAR considers multiple matching rules to achieve higher accuracy. Last, LB achieves high accuracy in many datasets. But the testing phase is the most complex since it uses four heuristic conditions, which are not easy to understand.

(18)

Chapter 3

Classification method

Our proposed framework, ACC, is a modified LB classifier. There are two differences between LB and ACC. First is the approach of dealing with missing values. LB omits the occurrence of missing values. If the value of an attribute is unknown, LB removes this value from the data object. Consequently, the size of each data object varies according to the number of missing values. Our method considers the probability of missing value occurrence. If an attribute has missing value, we treat the missing value as an item. Therefore, the sizes of all data objects are equal.

Second, LB assumes that not all attributes are independent. It discovers frequent and interesting itemsets to represent the relation between attributes and apply several heuris- tics to determine the order of itemsets for probability approximation of new data objects.

The probability approximation followings [4], which approximates higher order probability distributions by their lower order components. For example, {a1, a2}, {a1, a3}, and {a2, a3} are three subsets of {a1, a2, a3}. Suppose {a1, a2}, {a2, a3} are considered in turn, we can approximate P (a1, a2, a3) by P (a1, a2)P (a3|a2). If, on the other hand, {a1, a3}, {a2, a3} are considered first, we can approximate P (a1, a2, a3) by P (a1, a3)P (a2|a3). With different se- lection order, we can have different approximation of P (a1, a2, a3) (see Figure 3.1).

Different selection order makes different combinations of itemsets for approximation. And different combinations of itemsets will determine the different independence assumptions between attributes. Figure 3.2 shows the model built by the 2 approximations.

Considering a bigger example, suppose an available set for approximating P (a1, . . . , a5) is {{a1, a2}, {a2, a3}, {a3, a4}, {a3, a5}, {a2, a4, a5}}. Two possible approximations are shown

10

(19)

CHAPTER 3. CLASSIFICATION METHOD 11 Selection order product approximations

1. {a1, a2}, {a2, a3} ⇒ P (a1, a2)P (a3|a2) 2. {a1, a2}, {a1, a3} ⇒ P (a1, a2)P (a3|a1) 3. {a1, a3}, {a1, a2} ⇒ P (a1, a3)P (a2|a1) 4. {a1, a3}, {a2, a3} ⇒ P (a1, a3)P (a2|a3) 5. {a2, a3}, {a1, a2} ⇒ P (a2, a3)P (a1|a2) 6. {a2, a3}, {a1, a3} ⇒ P (a2, a3)P (a1|a3)

Figure 3.1: possible product approximation of P (a1, a2, a3)

E $SSUR[LPDWLRQ  D $SSUR[LPDWLRQ 

D D

D D

D D D

D D

D

Figure 3.2: The model built by two approximations.

as following.

1. {a1, a2}, {a2, a3}, {a3, a4}, {a3, a5} ⇒ P (a1, a2)P (a3|a2)P (a4|a3)P (a5|a3) 2. {a1, a2}, {a2, a3}, {a3, a4}, {a2, a4, a5} ⇒ P (a1, a2)P (a3|a2)P (a4|a3)P (a5|a2, a4)

In approximation 1, the {a2, a4, a5} will not be selected since after choosing {a3, a5}, all items are covered. Contrariwise, {a2, a4, a5} is selected first, {a3, a5} will never appears in approximation 2. The model built by the two product approximations shown in Figure 3.2.

The edge between two notes represent they are not independent. The selection order im- mediately effects the independence assumptions between attributes. For example, a3 and a5 in approximation 1 are assumed to be independent but they are considered associate in approximation 2.

Since different selection order makes different prediction result, we propose a χ2 testing method that considers both the relation between attribute items and the independency between attribute itemsets and classes.

(20)

CHAPTER 3. CLASSIFICATION METHOD 12

3.1 Learning Phase

The goal of the learning phase is to determine the relationship and independency between attributes. It extends Apriori [1] to discover frequent and interesting itemsets, which rep- resent the relationship between items. At the same time, our algorithm also checks the independency of each itemsets with classes by chi-square independence test.

3.1.1 Discovering Frequent Itemsets

Each data object in the training set, called training example, contains attributes and a class label. All attributes are assumed to be discrete. Continuous attributes can be discretized into intervals using standard discretization algorithms [3]. The resulting intervals are mapped into distinct values. Each possible attribute-value pair is treated as an item.

Let D be the set of training examples. Let C = {c1, . . . , cm} be the set of classes. Each training example is treated as a transaction that contains n non-missing attribute values and a class label. Each transaction is represented by an attribute itemset and is labeled with a class ci {a1, . . . , an, ci}. An itemset X has support s for class ci in D (denoted by X.supi = s), if s% of the transaction in D contain both ci and X. The support of X in D, denoted by X.sup, as the sum of supports X.supi for all classes in C. An itemset is called frequent if the support X.sup satisfies the user-specified minimum support (called min sup).

Notice that X.supi is the observed probability P (X, ci) and X.sup is the observed probability of P (X).

3.1.2 Discovering Interesting Itemsets

Long itemsets are obviously preferred for classification as they provide more information about the relations between attribute items. In order to find longer itemsets, the minimum support must be small. Discovering frequent itemsets with small minimum support usually generates a huge number of patterns. Since the probability of a long itemset can be estimated by its sub-itemsets. some long itemset can be omitted if the approximation probability of a long itmeset is closed to its actual probability. The interestingness measure is defined as follows.

(21)

CHAPTER 3. CLASSIFICATION METHOD 13 Let X be an itemset of size |X|. Let Xj and Xk be two (|X|-1)-itemsets obtained from X by removing the jth and kth item respectively. We use P (Xj, ci) and P (Xk, ci) to estimate P (X, ci). The estimate Pj,k(X, ci) is simply the product approximation in [4].

Pj,k(X, ci) = P (Xj, ci)P (Xk− Xj|Xj ∩ Xk, ci) = P (Xj, ci)P (Xk, ci)

P (Xj∩ Xk, ci) (3.1) The interestingness Ij,k of an itemset X with respect to Xj and Xj is defined as the error between estimate Pj,k(X, ci) and actual probability P (X, ci). If the difference is small, a long itemset can be estimated by its subsets. In order to measure the difference between two probabilities, we use the cross entropy from information theory. In Equation 3.2, when P (X, ci) = Pj,k(X, ci), the Ij,k becomes zero. If the Ij,k tends to zero, the itemset X is regarded as uninteresting. On the other hand, if the actual probability P (X, ci) is greater or less than estimate Pj,k(X, ci), the itemset X should be preserved since it can not be estimated accurately.

Ij,k =

ci

|P (X, ci)log P (X, ci)

Pj,k(X, ci)| (3.2)

Since there are C2|X| pairs of different Xj and Xk, the interestingness I(X) of an itemset X is defined as the average of all interestingness IP −Pj,k.

I(X) = 2

|X|(|X| − 1)

|X|−1

j=1

|X|

k=j+1

Ij,k (3.3)

3.1.3 Discovering Class-Related Itemsets

Frequent and interesting itemsets are not necessarily related to classes. We use χ2 indepen- dence test to check whether an itemset is independent of classes or not. If an itemset is independent to classes, we prune that itemset. Let |D| be the total number of transactions in training data. Let C = {c1, . . . , ck} be a set of classes. The contingency table for an itemset X is a 2× k table as shown in Figure 3.3

The chi-square independence test for an itemset X with class set C is defined as follows:

χ2 =

k i=1

(P (X, ci)− EX,ci)2 EX,ci +

k i=1

(P (¬X, ci)− E¬X,ci)2

E¬X,ci (3.4)

(22)

CHAPTER 3. CLASSIFICATION METHOD 14

c1 . . . ci . . . ck Row total

X occurred P (X, c1) . . . P (X, ci) . . . P (X, ck) P (X) X not occurred P (¬X, c1) . . . P (¬X, ci) . . . P (¬X, ck) P (¬X) Column total P (c1) . . . P (ci) . . . P (ck) |D|

Figure 3.3: Contingency table for itemsets X where EX,ci and E¬X,ci are expected values.

EX,ci = P (X)P (ci)

|D| , E¬X,ci = P (¬X)P (ci)

|D| (3.5)

An itemset passes the critical value of chi-square independence test is called a class- related itemset. If the itemsets selected in the product approximations are all 1-itemsets, the probability collapses to Naive Bayes. Therefore, all 1-itemsets are assumed frequent, interesting, and class-related in the learning phase.

3.2 Classification Phase

Let us denote the resulting set of itemsets found by learning phase by F I. To classify a new data object A ={a1, . . . , an}, our classifier approximates P (A, ci) for each class ci and assigns the object to a class with the highest value of P (A, ci). Since long itemsets provide more information about the higher order interaction between items, we first select maximum itemsets that are subsets of A and denote the set by AS. The itemsets in AS are then sorted by their χ2 value from high to low. The product approximation of A is created incrementally by adding one itemset from AS at a time until no more itemsets can be added. An item is included in the product approximation is said to be covered. An itemset is not adding to solution if it contains all covered items.

Figure 3.4 shows the incremental construction of a product approximation for P (A, ci).

Suppose a new data object A = {a1, . . . , a5}. Initially, the itemsets in the available set are sorted by their χ2 value and the non-covered set is set to A (i.e no item is covered in the product approximation solution). An itemset in the available set A is inserted into the solution if it contains at least one item in non-covered set. The first itemset considered is {a2a5}. Because items a2 and a5are both in non-covered set, the items a2 and a5are removed

(23)

CHAPTER 3. CLASSIFICATION METHOD 15 ( Non-covered set Itemset

selected

Product approxi- mation

Available Itemset List 0 {a1, a2, a3, a4, a5} ∅ N/A {{a2a5}, {a2a4}, {a1a2a3},

{a3a4}, {a1a4a5}}

1 {a1, a3, a4} {a2a5} P (a2a5ci) {{a2a4}, {a1a2a3}, {a3a4}, {a1a4a5}}

2 {a1, a3} {a2a4} P (a2a5ci)P (a4|a2ci) {{a1a2a3}, {a3a4}, {a1a4a5}}

3 {a1a2a3} P (a2a5ci)P (a4|a2ci) P (a1a3|a2ci)

{{a3a4}, {a1a4a5}}

Figure 3.4: Incremental construction of a product approximation for P (a1, . . . , a5, ci)

from non-covered set. The itemset {a2a5} is then removed from available itemset list and inserted into the solution. Now the first itemset in available itemlist is{a2a4}. Since item a4

is in non-covered set, the itemset{a2a4} is selected and items a4 is removed from non-covered set. Note that item a2 appeared in a previous product, it will be placed in the conditional part of product. In the third iteration, itemset {a1a2a3} is selected. After removing items a1 and a3 from non-covered set, all items in A are covered and the construction of product approximation is done. As a result, P (A, ci) is computed using the product approximation P (a2a5ci)P (a4|a2ci)P (a1a3|a2ci).

3.3 Learning Algorithm

The learning algorithm is outlined in Figure 3.5. It generates all the frequent, interesting, and class-related itemsets based on Apriori [1].

The input of the algorithm is a training dataset D. The output is the set of frequent, interesting, and class related itemsets F I with their class counts. In order to get the proba- bility of an itemset X with each class, each itemset has a class-counter (denoted by X.counti) for each class ci. The probability P (X, ci) is computed by dividing a class counter X.counti by the number of training dataset |D|, that is X.counti/|D| = P (X, ci). Each itemset X also has a variable X.xV alue to hold the chi-square value for classes.

In lines 1 and 2, the learner scans dataset D once to add all 1-itemsets in F1and determine

(24)

CHAPTER 3. CLASSIFICATION METHOD 16

Algorithm Learner

Input: the training dataset D

Output: the set F of itemsets X with their class counts X.counti and the chi-square value X.xV alue

1. F1 = {all 1-itemsets}

2. count X.counti for all X ∈ Fi 3. for (k = 2; Fk−1 = ∅; k + +) do 4. Cd = CandidateGen(Fk−1);

5. for each data example d∈ D do 6. i = class of D;

7. Cd= SubSetGen(Ck, d);

8. for each candidate X ∈ Cd do

9. X.counti+ +

10. end

11. end

12. Fk = P runeF (Ck)

13. for each itemset X ∈ Fk do 14. calculate X.xV alue;

15. if (X.xV alue < δ) then

16. X.xV alue = 0

17. end

18. end

19. end

20. F I =kFk

Figure 3.5: The Learner algorithm

(25)

CHAPTER 3. CLASSIFICATION METHOD 17 the count of each itemset for each class. Using Fk−1 as a seed set to generate a new set of possibly frequent itemsets of size k, called candidate set. If Fk−1 is not empty, the learner uses CandidataeGen function to generate candidate set of size k, which is assigned to Ck (line 4).

In lines 5-11, the learner scans dataset D to calculate the counts for all itemsets in Ck. For each data example d in D, using SubsetGen function to generate a set of subsets of d that each subset belongs to Ck is selected into Cd(line 7). In lines 8-10, the class count X.counti of each itemset X ∈ Cd is increased. After the class counting is completed, the P runeF function prunes unfrequent and uninteresting itemsets from Ck using interestingness measure and the remainder are added into Fk (line 12). The interestingness measure is as presented in section 3.1.

In lines 13-17, the learner determines the independency for each itemset in Fkby using chi- square independence test. If an itemset is independent with the class label (i.e the X.xV alue is less than the critical value δ of chi-square independence test), the itemset is treated as a class-independent itemset. Since the classification phase may use class independent itemsets as the denominator part of product approximation, these itemsets cannot be pruned in learning phase. We mark these itemsets by assigning 0 to the X.xV alue. Last, union all Fk as a set F I and return F I for classification phase to build a classifier for a new data object.

3.4 Classification Algorithm

In Figure 3.6, we presents the steps of classifying a new data object A by using frequent, interesting and class related itemset in F I. The goal of classifier is to generate and compute the product approximation of P (A, ci) for each class ci.

Initially, all items in A are added into a set, called non − coveredset (line 1). The num and den denote the set of itemsets in numerator and denominator part of product approximation. They are all empty initially (lines 2 and 3). In lines 4, the SelectM F I function selects maximum sub-itemsets of A andsorts each itemset by its chi-square value.

An itemset X is a maximum sub-itemset of A if no sub-itemset X of A (X ⊆ A) such that X ⊂ X.

(26)

CHAPTER 3. CLASSIFICATION METHOD 18

Algorithm Classifier

Input: the set F discovered in learning phase, and a new data object A Output: the class of A

1. N onCov = A 2. num =∅ 3. den =∅

4. AS = SelecteM F I(F )

5. for (k = 1; N onCov= ∅; k + +) do 6. t = the first itemset in AS;

7. if (t∩ NonCov = ∅) then

8. num = num∪ {t};

9. den = den∪ {t ∩ (A − NonCov)};

10. N onCov = N onCov− {t}

11. end

12. AS = AS− {t}

13. end

14. output that class ci with maximal P (A, ci) computed as:

15. P (A, ci) = P (ci)· x∈numP (x,ci)

y∈donP (y,ci)

Figure 3.6: The Classifier algorithm

In lines 5-13, the classifier selects itemsets from AS in turn until non− coveredset is empty. In each iteration, the classifier picks the first itemset t from AS and checks if an item in t is covered (line 6). If t has at least one item not yet covered, the itemset t can be inserted into the solution (line 7). Each product is the conditional probability of the non-covered items of t given the covered ones and the class ci, as follows.

P (N on− CoveredItems|CoveredItems, ci) = P (t, ci)

P (t∩ (A − NonCov), ci) (3.6) In lines 8 and 9, the itemset t is added into the numerator part of the product approx- imation and the set of covered items of t is inserted into the denominator part. Either an itemset t is added in solution or not, the itemset t is removed from AS (line 12). Finally, the classifier computes P (A, ci) for each class ci and classifies A to a class, which has the highest probability.

(27)

CHAPTER 3. CLASSIFICATION METHOD 19

3.5Zero Counts Smoothing

Consider a conditional probability P (x|y, ci) = P (x, y, ci)/P (y, ci), a zero counts exists when the numerator or denominator part of a conditional probability equal to zero (i.e.

P (x, y, ci) = 0 or P (y, ci) = 0). If the numerator part P (x, y, ci) = 0, the probability of P (A, ci) becomes zero which causes Line 14 in the classify algorithm to yield zero regard- less of the probabilities of other attributes. On the other hand, if the denominator part P (y, ci) = 0, the probability P (A, ci) is undefined. In order to eliminate the effect of zero counts, LB use the standard zero counts smoothing method as follows.

P (X|Y ) = nc+ mP

n + m (3.7)

where nc is the number of X ∩ Y , n is the number of Y , P is the probability of X, and m is the a small smoothing value.

In our method, we also use the same method to deal with zero counts. Instead of P (x|y, ci) = P (x, y, ci)/P (y, ci), the smoothed conditional probability is formulated as

P (x|y, ci) = |D|P (x, y, ci) + n0· P (x)

|D|P (y, ci) + n0 (3.8)

where n0 is a small smoothing parameter.

(28)

Chapter 4

Experimental Results and Discussion

In this chapter, we compare our method, ACC, with NB [2] and LB [7] by using 13 datasets from UCI Machine Learning Database Repository [8] to evaluate the performance. Since the type of attributes must be discrete, we use datasets with categorical type as possible.

All of the attribute types in 13 dataset are categorical or finite integral type. If the integral type is discredited as categorical type presented in documents of the datasets, we use these predefined categorical type. Otherwise, the finite integral type is treated as discredited (i.e.

each value is treated as one item). Figure 4.1 presents the datasets that we used. Column 3 (#Items) shows the number of items, which are mapped from non-class attribute value pair. The mapping method is described in section 3.1.1 Column 6 (#Train) denotes the size of a training set or the total number of a dataset. Column 7 (#Test) indicates the size of testing set or the experiment method, 10-fold cross-validation. The details are presented in the next.

We evaluate the accuracy and the compactness of a classifier by using holdout method or 10-fold cross-validation method. The holdout method splits a databset into two parts:

training set and validation set (usually a 2:1 ratio of examples). The 10-fold cross-validation method partitions a dataset into 10 parts as equal as possible. The experiment for a dataset must be iterated 10 times. In each iteration, using 9 parts as the training set and 1 part as the testing set. The accuracy measure is the average of the 10 folds. In our experiment, if a dataset is pre-partitioned into training set and testing set, we use holdout method for the dataset. Otherwise, using 10-fold cross-validation method for the dataset. In Figure 4.1, the datasets monks1, monks2, and monks3 (datasets 6-8) are measured by using holdout

20

(29)

CHAPTER 4. EXPERIMENTAL RESULTS AND DISCUSSION 21

$WWULEXWHV ,WHP &ODVVHV 0LVVLQJ9DOXHV 7UDLQ 7HVW

 EDODQFH      &9

 EUHDVW      &9

 FDU      &9

 FKHVV      &9

 IODUH      &9

 PRQNV      

 PRQNV      

 PRQNV      

 QXUVHU\      &9

 SRVW      &9

 7LF7DF7RH      &9

 YRWLQJ      &9

 ]RR      &9

'DWDVHW

Figure 4.1: Description of datasets

method. The remainder datasets are measured by using 10-fold cross-validation, which marked ”CV10” in Column 7 (#Test) with the total number of a dataset shown in Column 6 (#Train).

4.1 Parameter Setting

For association based classifiers, the minimum support threshold and the zero count smooth- ing parameter are set to default value as suggested in the literature. The minimum support threshold is fixed to 1% or 5 occurrences for small datasets and the zero count smoothing parameter n0 is set to 5 for all datasets. The interestingness threshold was set to 0.04 in LB. However, this value is too large for most of the 13 datasets. It makes LB collapse to NB in 10 datasets. Therefore, we set the interestingness threshold from 0.005 to 0.03 for LB and ACC. The critical value δ of chi-square independence test is determined by the setting of p-value and the degrees-of-freedom. In ACC, the p-value is set to 0.05 for all datasets.

(30)

CHAPTER 4. EXPERIMENTAL RESULTS AND DISCUSSION 22

4.2 Experimental Results

Figure 4.2 shows the result of accuracy measure on the 13 datasets for value of interestingness threshold varying from 0.005 to 0.03. Since no classification method can outperform all others in all possible domains [10], we compare the accuracy of three classifiers in several perspectives. In Figure 4.2, the accuracy of different interestingness threshold is averaged in ”Average” column. The last column ”Winner” shows which classifier achieves the best average accuracy in a dataset. As the figure shows, ACC achieves the best average accuracy in 6 datasets. NB and LB have the best average accuracy in 4 and 3 datasets.

Figure 4.3 provides the individual accuracy of three methods with various interestingness threshold. For each dataset, the most accurate classifier is marked with shadow. The inter- estingness threshold is increasing from left to right in Figure 4.3. Since the interestingness threshold is increasing, the results of LB and ACC is more similar to NB, and we get the same measures for the three methods. As the figure shows, the number of the best accuracy for ACC is more than or equal to what NB and LB achieved. That means the result is stable and not sensitive to the variation of interestingness threshold.

Figure 4.4 shows an extreme case, which omits the minimum support threshold (i.e. the minimum support threshold is set to zero for all datasets). The best accuracy is marked with shadow. ACC is stable with different interestingness threshold and achieves the best accuracy in most datasets.

4.3 Discussion

4.3.1 The Effect of Missing Value

In 13 datasets, there are two datasets, Post and Voting, which contain missing values. Post contains only 3 missing values. Voting contains 392 missing values, which are 5.6% of all data objects. In this section, we experiment and discuss the effect of missing values in accuracy. Since the Voting dataset has only 5.6% missing values, we increate the percentage of missing values by varying from 10% to 50% for each dataset. The minimum support and interestingness threshold are fixed to 1% and 0.01. Fig 4.5 and Fig 4.6 show the experimental results. In 7 datasets (Fig 4.5 (a) - (d), Fig 4.6(i), and Fig 4.6 (l) - (m)), the accuracies of

參考文獻

相關文件

In this paper, we build a new class of neural networks based on the smoothing method for NCP introduced by Haddou and Maheux [18] using some family F of smoothing functions.

Section 3 is devoted to developing proximal point method to solve the monotone second-order cone complementarity problem with a practical approximation criterion based on a new

A derivative free algorithm based on the new NCP- function and the new merit function for complementarity problems was discussed, and some preliminary numerical results for

S15 Expectation value of the total spin-squared operator h ˆ S 2 i for the ground state of cationic n-PP as a function of the chain length, calculated using KS-DFT with various

Usually the goal of classification is to minimize the number of errors Therefore, many classification methods solve optimization problems.. We will discuss a topic called

The Prajñāpāramitā-hṛdaya-sūtra (般若波羅蜜多心經) is not one of the Vijñānavāda's texts, but Kuei-chi (窺基) in his PPHV (般若波羅蜜多心經 幽賛) explains its

For your reference, the following shows an alternative proof that is based on a combinatorial method... For each x ∈ S, we show that x contributes the same count to each side of

Biases in Pricing Continuously Monitored Options with Monte Carlo (continued).. • If all of the sampled prices are below the barrier, this sample path pays max(S(t n ) −