dWwsD h Tu{s j

(1)

國立中央大學

資訊工程研究所碩士論文

應用卡方獨立性檢定於關連式分類問題

指導教授：張嘉惠博士研究生：張毓美

中華民國九十一年六月

(2)

國立中央大學圖書館碩博士論文授權書

(91 年 5 月最新修正版)

本授權書所授權之論文全文與電子檔，為本人於國立中央大學，撰寫之碩/博士學位論文。(以下請擇一勾選)

( ∨ )同意 (立即開放)

( )同意 (一年後開放)，原因是：

( )同意 (二年後開放)，原因是：

( )不同意，原因是：

以非專屬、無償授權國立中央大學圖書館與國家圖書館，基於推動讀者間「資源共享、互惠合作」之理念，於回饋社會與學術研究之目的，

得不限地域、時間與次數，以紙本、光碟、網路或其它各種方法收錄、

重製、與發行，或再授權他人以各種方法重製與利用。以提供讀者基於個人非營利性質之線上檢索、閱覽、下載或列印。

研究生簽名：張毓美

論文名稱：應用卡方獨立性檢定於關連式分類問題指導教授姓名：張嘉惠

系所：資訊工程所 o 博士 n 碩士班學號：89522034

日期：民國 91 年 7 月 12 日

備註：

1. 本授權書請填寫並親筆簽名後，裝訂於各紙本論文封面後之次頁（全文電子檔內之授權書簽名，可用電腦打字代替）。

2. 請加印一份單張之授權書，填寫並親筆簽名後，於辦理離校時交圖書館（以統一代轉寄給國家圖書館）。

3. 讀者基於個人非營利性質之線上檢索、閱覽、下載或列印上列論文，應依著作權法相關規定辦理。

(3)

摘要

分類問題一直是機器學習領域中的主要問題。近年來，由於關連式規則挖掘技術的興起，使得越來越多的研究以關連式規則挖掘的技術來解決分類問題。在本篇論文中，我們研究及探討幾個關連式分類問題的方法，並且提出一個新的分類方法，此方法稱為 ACC（意即「應用卡方獨立性檢定於關連式分類問題」）。ACC 利用關連式規則挖掘技術找出所有頻繁且有趣的項目集，利用這些項目集建立屬性與屬性之間的關係。除此之外，ACC 利用卡方獨立性檢定來檢測屬性與類別之間的關係，以保留與類別相關的頻繁集來做預測。我們使用 UCI 機器學習資料庫中的 13 個資料庫進行實驗，將我們的方法（ ACC）與 NB 及 LB 兩種高效率及高正確性的方法做比較。實驗結果顯示，我們的方法在大多數的資料庫上優於 NB 及 LB，亦是一種高效率及高正確性的分類方法。

(4)

Abstract

For many years, classification is one of the key problems in machine learning research. Since association rule mining is an important and highly active data mining research, there are more and more classification methods based on association rule mining techniques. In this thesis, we study several association based classification methods and provide the compar- ison of these classifiers. We present a new method, called ACC (i.e. Association based Classification using Chi-square independence test), to solve the problems of classification.

ACC ﬁnds frequent and interesting itemsets, which describe the relations between attributes.

Moreover, it applies chi-square independence test to remain class-related itemsets for pre- dicting new data objects. Besides, ACC provides an approach that considers the probability of missing value occurrence to solve the problem of missing value. Our method is experi- mented on 13 datasets from UCI machine learning database repository. We compare ACC with NB and LB, the state-of-the-art classifiers and the experimental results show that our method is a highly effective, accurate classifier.

(5)

誌謝

首先要感謝的是指導教授張嘉惠老師的細心指導與教誨，以及陳國棟教授、洪炯宗教授給予的授業指導，使我在資料庫實驗室兩年來，學習到作研究的精神與方法，並得以順利完成論文研究。感謝口試委員陳彥良教授、陳銘憲教授、許鈞南教授、顏秀珍教授於百忙之中審閱論文，並提供建議使本論文更加完善。

此外，特別感謝圖書館的周芳秀主任對我的關懷和鼓勵，同時，還要感謝的是同窗好友士賢、東軒、龍凱、雍益、仁皓、良一、廉勳、國炎、世賢、

傳正、晶月、惠玲等人，在這兩年來大家互相鼓勵、扶持，讓我感受到實驗室的溫暖與向心力，也感謝學弟們智凱、國瑜、志龍、聰鑫、釋謙在這一年來的幫忙，感謝這些人陪我一起走過研究所的這段日子。

最後，要感謝我的父母、家人以及其他親朋好友給我的支持與鼓勵，讓我能有個安心的環境順利完成學業。

謹將這份榮耀獻給所有關心我的家人與朋友。

國立中央大學資訊工程研究所資料庫系統實驗室研究生張毓美僅誌於中大 2001 年 7 月

(6)

List of Figures

2.1 the comparison of four classiﬁers . . . 9

3.1 possible product approximation of P (a₁, a₂, a₃) . . . 11

3.2 The model built by two approximations. . . 11

3.3 Contingency table for itemsets X . . . . 14

3.4 Incremental construction of a product approximation for P (a₁, . . . , a₅, c_i) . . 15

3.5 The Learner algorithm . . . 16

3.6 The Classiﬁer algorithm . . . 18

4.1 Description of datasets . . . 21

4.2 Average accuracy . . . 23

4.3 Accuracy for various values of interestingness threshold under minimum support is set to 1% . . . 24

4.4 Accuracy for various values of interestingness threshold under minimum support is set to 0 . . . 25

4.5 The eﬀect of missing value . . . 27

4.6 The eﬀect of missing value ( Cont.) . . . 28

4.7 Number of itemsets used by ACC in diﬀerent interestingness . . . 29

4.8 The eﬀect of interestingness threshold . . . 29

4.9 The eﬀect of minimum support threshold . . . 30

ii

(8)

List of Tables

2.1 Meaning of the symbols used . . . 8

iii

(9)

Chapter 1 Introduction

Classiﬁcation is one of the key problems in data mining and machine learning research.

Given a training set where each data object in the set has a class label, the task is to build a model, called classifier, to predict the class label for new data objects. In previous studies, there have been many techniques devised to build accurate classifiers, for example, decision trees, Bayesian classifier, and neural network, etc.

Association rule mining is an important and highly active area of data mining research.

The techniques of frequent itemset discovering in association rule mining are widely used in many applications. Since there exist some common concepts between association rule and traditional classiﬁcation, several recent studies apply these concepts to integrate association rule mining techniques into the problem of traditional classiﬁcation. In this thesis, we call such an integrated framework association based classification.

1.1 Association Rule Mining

Association rule mining extracts a set of rules that satisfy the user-specified minimum support and minimum confidence constrains. It consists of two steps. The first step is frequent itemset discovering, and the second step is association rule discovering. Let I be a set of items, I ={i1, i₂, . . . , i_m}. Let a database D be a set of transactions, where each transaction T is a subset of I. If an itemset X is contained in a transaction T (X ⊆ T ), the transaction T is said to support X. If the number of transactions support itemset X is equal to or greater than the minimum support threshold, the itemset X is called a frequent itemset. An association rule X → Y , where X ⊆ I, Y ⊆ I, and X ∩ Y = ∅, has support s and confidence

1

(10)

CHAPTER 1. INTRODUCTION 2 c if s% transactions contain X∪ Y and c% of transactions that contain X also contain Y .

1.2 Concepts of Association Based Classiﬁcation

In association based classification, we are interested in a particular kind of association rules called class association rules (CAR). Let A = {a₁, a₂, . . . , a_m} be the set of attributes, C = {c1, c₂, . . . , c_n} be the set of class labels, A ∩ C = ∅, we denote as the union of A and C. The class association rule mining is to extract rules X → c that satisfy the support and confidence thresholds, where X is the set of attributes (X ⊆ A), and c is a class label (c ∈ C). Note that the itemset X in class association rule X → c is a frequent itemset. To discover class assocaiton rule mining, we first discover non-class frequent itemsets.

From the perspective of probability, the support of an itemset X is the probability that the itemset X occurs; the support of an association rule is simply the probability that a rule sustains. The conﬁdence of an association rule X → Y is the conditional probability of Y given X has occurred. To classify an object X is equivalent to compute the conditional probability P (c|X) for each class c, i.e., the conﬁdence for class association rule X → c.

Therefore, class association rule mining can be applied to solve the problem of classification based on the theory of probability. The simplest classifier based on probability is Naive Bayes (NB) classifier [2]. Reasearch on association based classification, including Large Bayes (LB) [7], CBA [6], and CMAR [5], will be discussed in chapter 2.

1.3 Method and Goal

In this thesis, we propose a classifier, called ACC, (i.e. Association based Classification using Chi-square independence test). The classifier ACC is similar to LB in mining association between attributes. LB uses association mining to find frequent and interesting itemsets and applies several heuristics to select itemsets for probability approximation. Although LB is an accurate and efficient classifier, the heuristic conditions for classifying a new data object are complicated. Therefore, we propose a simplified method to solve the classification problem but still hold high accuracy and high performance.

Our method, ACC, uses Chi-square independence test to analyze the relations between

(11)

CHAPTER 1. INTRODUCTION 3 attributes and classes. Then choosing class-related itemsets for probability approximation.

We conduct several experiments on 13 datasets from UCI machine learning dataset repository [8]. The experimental results shows that our method is more accurate than NB and LB.

1.4 Organization of the Thesis

The rest of this thesis is organized as follows. In Chapter 2, we provide the related works of association based classification and discuss the difference between these classifiers. Chapter 3 presents our method in detail. Chapter 4 presents the experimental results of ACC and compares its accuracy with LB and NB. The thesis is concluded in Chapter 5.

(12)

Chapter 2 Related Work

Data classification is a two-step process. In the first step, a model is built to describe a set of training data. In the second step, the model is used for classification. To compare various algorithms, one can measure the predictive accuracy, speed, robustness, scalability, etc. In this chapter, we described two recent approaches extended from association mining.

The first approach is based on class association rule. The earliest one is CBA proposed by Bing Liu, Wynne Hsu, and Yiming Ma in 1998 [6]. CMAR is an extension proposed by Wenmin Li, Jiawei Han, and Jian Pei in 2001 [5]. CBA builds a classifier that is just the set of class association rules. CMAR extends CBA but considers the multiple rules for a new data object. The second approach is based on frequent itemsets to sift interesting ones for probability predication. LB is the representing work of this approach. Since LB is an extension of NB that applies association mining to find the long itemsets, we first give an introduction to Naive Bayes, then compare recent studies.

2.1 NB - Naive Bayes Classiﬁer

Let C be the set of class label C = {c1, c₂, . . . , c_m}, each data object has a set of attribute values X = {a₁, a₂, . . . , a_n}. NB classiﬁes a new data object X to a class c_i that has the highest posterior probability P (c_i|X).

According to Bayes theorem P (H|X) = P (X|H)∗P (H)/P (X), the posterior probability P (c_i|X) is equivalent to P (ci)P (X|ci)/P (X). Since the denominator P (X) for each class c_i is the same, the calculation of P (c_i|X) can be simpliﬁed to calculate P (c_i)P (X|c_i). NB assumes that the eﬀect of an attribute value on a given class is independent of the values of

4

(13)

CHAPTER 2. RELATED WORK 5

the other attributes. Therefore, P (c_i|X) can be approximated by the following formula:

P (c_i|X) = P (c_i)P (a₁, a₂, . . . , a_n|c_i)

= P (ci)P (a1|ci)P (a2|ci) . . . P (an|ci) = P (ci)ⁿ_j=1P (aj|ci) (2.1) The independent assumption makes the calculation quite easier.

Learning phase: Given a training set consists of a set of attributes A = {a1, a₂, . . . , a_n} and a set of class label C = {c₁, c₂, . . . , c_m}, calculate P (c_i) and P (a_i|c_i). The classiﬁer records all of these probabilities for the next phase.

Testing phase: Given a new object X ={a₁, a₂, . . . , a_k}, use Equation 2.1 to calculate P (c_i|X) for each ci and then classify to the class with highest P (c_i|X).

2.2 LB - Large Bayes Classiﬁer

LB [7] is an extension of NB. It applies association mining to ﬁnd frequent and interesting itemsets. LB relaxes the assumption of attribute independence. For example, sup- pose we know a₁, a₂, a₃ are dependent, but are independent to a₄ and a₅. To approximate P (a₁, a₂, a₃, a₄, a₅|ci), we can compute the product P (a₁, a₂, a₃|ci)∗ P (a4|ci)∗ P (a5|ci).

The main idea of LB is to ﬁnd the relation between attributes using frequent itemsets discovering. In order to avoid huge frequent itemsets, LB adopts a variation of cross-entropy I_{P −P} for two probability distributions P and P as an interestingness measure to prune itemsets.

How do we approximate P (c_i|X) from a set of interesting frequent itemsets? This depends on what independence assumptions are considered ﬁrst. For a set of interesting frequent itemsets {a₂, a₅}, {a₁, a₂, a₃}, and {a₁, a₄, a₅}, the following are two valid product approximations of P (a₁, a₂, a₃, a₄, a₅|ci), depending on what itemset we considered ﬁrst.

1. {a₁, a₂, a₃}, {a₁, a₄, a₅} ⇒ P (a₁, a₂, a₃, c_i)P (a₄, a₅|a₁, c_i)

2. {a1, a₂, a₃}, {a2, a₅}, {a1, a₄, a₅} ⇒ P (a1, a₂, a₃, c_i)P (a₅|a2, c_i)P (a₄|a1, a₅, c_i)

Therefore, the order of selecting frequent itemset for products is important. LB uses four conditions to determine the order of itemsets. When the selected itemsets are all of size one, LB collapses to NB. The computation for interestingness measure and selecting itemsets is much more expensive than NB.

(14)

Learning phase: Given a training set consists of a set of attributes A = {a1, a₂, . . . , a_n} and a set of class label C = {c₁, c₂, . . . , c_m}, discover the set of interesting and frequent itemsets, called F I from the attribute set (i.e., each itemset in F I consists of attribute items only). Each itemset in F I records its interestingness value and its counts in each class. The classiﬁer keeps F I for the next phase.

Testing phase: Determine the frequent itemsets F I in an order according to the four conditions speciﬁed in [7]. Approximate P (a₁, a₂, . . . , a_k, c_i) for each c_i using F I according to the order. Calculate all products, and then classify to the highest class.

2.3 CBA - Classiﬁcation Based on Associations

The idea of CBA [6] is to generate a set of class association rules (CAR) as a classiﬁer. A class association rule (call ruleitem) is of the form, < condset, c >, which represents a rule, condset→ c, where condset is a set of attribute items, c is a class label. If a new data object X matches the condset (i.e., that X ⊇ condset), then classify the object X to class c.

CBA consists of two parts, a rule generator (called CBA-RG) and a classifier builder (called CBA-CB). Rule generator is based on Apriori algorithm and finds all ruleitems. Two ruleitems may have the same condset, but different c. The classifier builder ranks each rule and determines the default class for the classifier as follows.

Rule Rank: Given two rules r and r, written r r (called r has higher rank than r) if the following conditions hold:

1. the conﬁdence of r is greater than that of r, or

2. their conﬁdences are the same, but the support of r is greater than that of r, or 3. both the conﬁdences and supports of r and r are the same, but the size of r is smaller than r.

Default Class: Let r be an itemrule, and D be the training data, D_r is the set of data objects that support rule r. The default class is the majority class in the remaining of D minus D_r.

Learning phase: Given a training set consists of a set of attributes A = {a1, a₂, . . . , a_n} and a set of class label C ={c₁, c₂, . . . , c_m}, mining class association rules with the highest conﬁdence for each class. All rules are ranked by the conﬁdence, support, and the size of

(15)

CHAPTER 2. RELATED WORK 7 itemset. The classiﬁer is of the form < r₁, r₂, . . . , r_n, c_i, def ault class >.

Testing phase: Given a new object A = {a₁, a₂, . . . , a_k}, ﬁnd the ﬁrst matching rule, and then classify to that class.

2.4 CMAR - Classiﬁcation Based on Multiple Associ- ation Rules

As the name implies, CMAR [5] is similar to CBA, but considers multiple association rules that matched by a new data object. If the rules are not consistent in class labels, CMAR divides the rules into groups according to class labels. All rules in a group share the same class label. Each matching rule has a χ² value. For each rule R : P → c, let sup(c) be the number of data objects in training data set that associatied with class label c and |T | the number of data objects in the training data set. As following, CMAR deﬁne maxχ² to computer the upper bound of χ² for rule R.

maxχ² = (min{sup(P ), sup(c)} −sup(P )sup(c)

|T | )²|T |e (2.2)

where

e =

sup(P )sup(c)¹ +sup(P )(|T |−sup(c))¹

+(|T |−sup(P ))sup(c)¹ +(|T |−sup(P ))(|T |−sup(c))¹

(2.3) CMAR sums up χ² value of rules in a group for each group. The group with the highest χ² value is used to predict a new data object.

Learning phase: similar to CBA but not ranks the rules

Testing phase: Given a new object A = {a1, a₂, . . . , a_k}, ﬁnd the set of matching rule and divide the rule into groups according to class labels. Using weighted χ² to analysis the groups and choose the group with the highest χ² value.

2.5Comparison

In this section, we compare four algorithms NB, LB, CBA, and CMAR in Figure 2.1. We compare the space complexity and time complexity for learning phase. For testing phase, we compare the performance, classiﬁer complexity, and accuracy. The symbols in Figure 2.1 are described as follows:

(16)

Symbol Description

c the number of classes I the set of total attribute items L_k the set of frequent k-itemsets C_k the set of candidate k-itemsets F I the set of total frequent itemsets CAR the set of class association rules

Table 2.1: Meaning of the symbols used

Figure 2.1 shows the worst case for each classiﬁer. In the learning phase, NB uses the least space since it only records the count for each item. LB uses the most since the probability approximation may use the information of L_k−1 and L_k−2. CBA and CMAR both generate C_k by using L_k for each length k. In terms of speed, NB only scans database once and the others should scan k times.

In the testing phase, NB records all attribute items for each classes. LB holds the counts of all frequent and interesting itemsets with each class and an extra space for interestingness value of each itemset. CBA and CMAR store the class association rules generated from learning phase. The time complexity for each method depends on the space they take since the worst case is searching the whole space.

In terms of classifier complexity, NB is the easiest classifier, which times all conditional probabilities for each class. The next is CBA, which finds the first matching rule and then predicts. The third is CMAR since it should consider multiple matching rules. LB is more complicated than others since it uses four heuristic conditions.

The accuracy comparison is made from the experimental results presented in the literatures. LB outperforms NB and CBA as shown in [7] and CMAR is more accurate than CBA as presented in [5]. CBA is better than NB as described in [9]. LB and CMAR are not compared in the literatures.

2.6 Summary

In this section, we summarize the comparison of time complexity, space complexity, accuracy, and classiﬁer complexity for the four algorithms. From Figure 2.1, we found that considering

(17)

Phase NB LB CBA CMAR

Learning Space I L_k−2+ L_k−1+ C_k L_k−1+ C_k L_k−1+ C_k phase Time 1 DB scans k DB scans k DB scans k DB scans

Space cI |c + 1|F I CAR CAR

Testing Time O(I) O(FI) O(CAR) O(CAR)

phase Classiﬁer complexity

easiest most complex easy middle

Accuracy Lowest better than CBA better than NB

better than CBA

Figure 2.1: the comparison of four classiﬁers

the relationship between items will achieve higher accuracy. NB is the easiest and the most eﬃcient classiﬁer but has lowest accuracy since it assumes all attributes are independent.

Second, CBA is also an easy classifier since it only considers one matching rule and prefers high confidence rules. Usually, high confidence rules are rules with low support. Therefore, CMAR considers multiple matching rules to achieve higher accuracy. Last, LB achieves high accuracy in many datasets. But the testing phase is the most complex since it uses four heuristic conditions, which are not easy to understand.

(18)

Chapter 3 Classiﬁcation method

Our proposed framework, ACC, is a modified LB classifier. There are two differences between LB and ACC. First is the approach of dealing with missing values. LB omits the occurrence of missing values. If the value of an attribute is unknown, LB removes this value from the data object. Consequently, the size of each data object varies according to the number of missing values. Our method considers the probability of missing value occurrence. If an attribute has missing value, we treat the missing value as an item. Therefore, the sizes of all data objects are equal.

Second, LB assumes that not all attributes are independent. It discovers frequent and interesting itemsets to represent the relation between attributes and apply several heuristics to determine the order of itemsets for probability approximation of new data objects.

The probability approximation followings [4], which approximates higher order probability distributions by their lower order components. For example, {a1, a₂}, {a1, a₃}, and {a2, a₃} are three subsets of {a₁, a₂, a₃}. Suppose {a₁, a₂}, {a₂, a₃} are considered in turn, we can approximate P (a₁, a₂, a₃) by P (a₁, a₂)P (a₃|a2). If, on the other hand, {a1, a₃}, {a2, a₃} are considered first, we can approximate P (a₁, a₂, a₃) by P (a₁, a₃)P (a₂|a₃). With different se- lection order, we can have different approximation of P (a₁, a₂, a₃) (see Figure 3.1).

Different selection order makes different combinations of itemsets for approximation. And different combinations of itemsets will determine the different independence assumptions between attributes. Figure 3.2 shows the model built by the 2 approximations.

Considering a bigger example, suppose an available set for approximating P (a₁, . . . , a₅) is {{a1, a₂}, {a2, a₃}, {a3, a₄}, {a3, a₅}, {a2, a₄, a₅}}. Two possible approximations are shown

10

(19)

CHAPTER 3. CLASSIFICATION METHOD 11 Selection order product approximations

1. {a₁, a₂}, {a₂, a₃} ⇒ P (a₁, a₂)P (a₃|a₂) 2. {a1, a₂}, {a1, a₃} ⇒ P (a1, a₂)P (a₃|a1) 3. {a₁, a₃}, {a₁, a₂} ⇒ P (a₁, a₃)P (a₂|a₁) 4. {a₁, a₃}, {a₂, a₃} ⇒ P (a₁, a₃)P (a₂|a₃) 5. {a2, a₃}, {a1, a₂} ⇒ P (a2, a₃)P (a₁|a2) 6. {a₂, a₃}, {a₁, a₃} ⇒ P (a₂, a₃)P (a₁|a₃)

Figure 3.1: possible product approximation of P (a₁, a₂, a₃)

E $SSUR[LPDWLRQ D $SSUR[LPDWLRQ

D D

D D D

D D

D

Figure 3.2: The model built by two approximations.

as following.

1. {a₁, a₂}, {a₂, a₃}, {a₃, a₄}, {a₃, a₅} ⇒ P (a₁, a₂)P (a₃|a₂)P (a₄|a₃)P (a₅|a₃) 2. {a1, a₂}, {a2, a₃}, {a3, a₄}, {a2, a₄, a₅} ⇒ P (a1, a₂)P (a₃|a2)P (a₄|a3)P (a₅|a2, a₄)

In approximation 1, the {a2, a₄, a₅} will not be selected since after choosing {a3, a₅}, all items are covered. Contrariwise, {a₂, a₄, a₅} is selected ﬁrst, {a₃, a₅} will never appears in approximation 2. The model built by the two product approximations shown in Figure 3.2.

The edge between two notes represent they are not independent. The selection order im- mediately eﬀects the independence assumptions between attributes. For example, a₃ and a₅ in approximation 1 are assumed to be independent but they are considered associate in approximation 2.

Since diﬀerent selection order makes diﬀerent prediction result, we propose a χ² testing method that considers both the relation between attribute items and the independency between attribute itemsets and classes.

(20)

CHAPTER 3. CLASSIFICATION METHOD 12

3.1 Learning Phase

The goal of the learning phase is to determine the relationship and independency between attributes. It extends Apriori [1] to discover frequent and interesting itemsets, which represent the relationship between items. At the same time, our algorithm also checks the independency of each itemsets with classes by chi-square independence test.

3.1.1 Discovering Frequent Itemsets

Each data object in the training set, called training example, contains attributes and a class label. All attributes are assumed to be discrete. Continuous attributes can be discretized into intervals using standard discretization algorithms [3]. The resulting intervals are mapped into distinct values. Each possible attribute-value pair is treated as an item.

Let D be the set of training examples. Let C = {c1, . . . , c_m} be the set of classes. Each training example is treated as a transaction that contains n non-missing attribute values and a class label. Each transaction is represented by an attribute itemset and is labeled with a class c_i {a₁, . . . , a_n, c_i}. An itemset X has support s for class c_i in D (denoted by X.sup_i = s), if s% of the transaction in D contain both c_i and X. The support of X in D, denoted by X.sup, as the sum of supports X.sup_i for all classes in C. An itemset is called frequent if the support X.sup satisﬁes the user-speciﬁed minimum support (called min sup).

Notice that X.sup_i is the observed probability P (X, c_i) and X.sup is the observed probability of P (X).

3.1.2 Discovering Interesting Itemsets

Long itemsets are obviously preferred for classification as they provide more information about the relations between attribute items. In order to find longer itemsets, the minimum support must be small. Discovering frequent itemsets with small minimum support usually generates a huge number of patterns. Since the probability of a long itemset can be estimated by its sub-itemsets. some long itemset can be omitted if the approximation probability of a long itmeset is closed to its actual probability. The interestingness measure is defined as follows.

(21)

CHAPTER 3. CLASSIFICATION METHOD 13 Let X be an itemset of size |X|. Let Xj and X_k be two (|X|-1)-itemsets obtained from X by removing the j^th and k^th item respectively. We use P (X_j, c_i) and P (X_k, c_i) to estimate P (X, c_i). The estimate P_j,k(X, c_i) is simply the product approximation in [4].

P_j,k(X, c_i) = P (X_j, c_i)P (X_k− Xj|Xj ∩ Xk, c_i) = P (X_j, c_i)P (X_k, c_i)

P (X_j∩ Xk, c_i) (3.1) The interestingness I_j,k of an itemset X with respect to X_j and X_j is defined as the error between estimate P_j,k(X, c_i) and actual probability P (X, c_i). If the difference is small, a long itemset can be estimated by its subsets. In order to measure the difference between two probabilities, we use the cross entropy from information theory. In Equation 3.2, when P (X, c_i) = P_j,k(X, c_i), the I_j,k becomes zero. If the I_j,k tends to zero, the itemset X is regarded as uninteresting. On the other hand, if the actual probability P (X, c_i) is greater or less than estimate P_j,k(X, c_i), the itemset X should be preserved since it can not be estimated accurately.

I_j,k =

ci

|P (X, ci)log P (X, c_i)

P_j,k(X, c_i)| (3.2)

Since there are C₂^|X| pairs of diﬀerent X_j and X_k, the interestingness I(X) of an itemset X is deﬁned as the average of all interestingness I_{P −P}_j,k.

I(X) = 2

|X|(|X| − 1)

|X|−1

j=1

|X|

k=j+1

I_j,k (3.3)

3.1.3 Discovering Class-Related Itemsets

Frequent and interesting itemsets are not necessarily related to classes. We use χ² independence test to check whether an itemset is independent of classes or not. If an itemset is independent to classes, we prune that itemset. Let |D| be the total number of transactions in training data. Let C = {c1, . . . , c_k} be a set of classes. The contingency table for an itemset X is a 2× k table as shown in Figure 3.3

The chi-square independence test for an itemset X with class set C is deﬁned as follows:

χ² =

k i=1

(P (X, c_i)− EX,ci)² E_X,c_i +

k i=1

(P (¬X, ci)− E¬X,ci)²

E_¬X,c_i (3.4)

(22)

c₁ . . . c_i . . . c_k Row total

X occurred P (X, c₁) . . . P (X, c_i) . . . P (X, c_k) P (X) X not occurred P (¬X, c1) . . . P (¬X, ci) . . . P (¬X, ck) P (¬X) Column total P (c₁) . . . P (c_i) . . . P (c_k) |D|

Figure 3.3: Contingency table for itemsets X where E_X,c_i and E_¬X,c_i are expected values.

E_X,c_i = P (X)P (c_i)

|D| , E_¬X,c_i = P (¬X)P (ci)

|D| (3.5)

An itemset passes the critical value of chi-square independence test is called a class- related itemset. If the itemsets selected in the product approximations are all 1-itemsets, the probability collapses to Naive Bayes. Therefore, all 1-itemsets are assumed frequent, interesting, and class-related in the learning phase.

3.2 Classiﬁcation Phase

Let us denote the resulting set of itemsets found by learning phase by F I. To classify a new data object A ={a1, . . . , a_n}, our classiﬁer approximates P (A, ci) for each class c_i and assigns the object to a class with the highest value of P (A, c_i). Since long itemsets provide more information about the higher order interaction between items, we ﬁrst select maximum itemsets that are subsets of A and denote the set by AS. The itemsets in AS are then sorted by their χ² value from high to low. The product approximation of A is created incrementally by adding one itemset from AS at a time until no more itemsets can be added. An item is included in the product approximation is said to be covered. An itemset is not adding to solution if it contains all covered items.

Figure 3.4 shows the incremental construction of a product approximation for P (A, c_i).

Suppose a new data object A = {a1, . . . , a₅}. Initially, the itemsets in the available set are sorted by their χ² value and the non-covered set is set to A (i.e no item is covered in the product approximation solution). An itemset in the available set A is inserted into the solution if it contains at least one item in non-covered set. The ﬁrst itemset considered is {a2a₅}. Because items a2 and a₅are both in non-covered set, the items a₂ and a₅are removed

(23)

CHAPTER 3. CLASSIFICATION METHOD 15 ( Non-covered set Itemset

selected

Product approximation

Available Itemset List 0 {a₁, a₂, a₃, a₄, a₅} ∅ N/A {{a₂a₅}, {a₂a₄}, {a₁a₂a₃},

{a3a₄}, {a1a₄a₅}}

1 {a1, a₃, a₄} {a2a₅} P (a₂a₅c_i) {{a2a₄}, {a1a₂a₃}, {a3a₄}, {a₁a₄a₅}}

2 {a₁, a₃} {a₂a₄} P (a₂a₅c_i)P (a₄|a₂c_i) {{a₁a₂a₃}, {a₃a₄}, {a₁a₄a₅}}

3 ∅ {a1a₂a₃} P (a2a₅c_i)P (a₄|a2c_i) P (a₁a₃|a2c_i)

{{a3a₄}, {a1a₄a₅}}

Figure 3.4: Incremental construction of a product approximation for P (a₁, . . . , a₅, c_i)

from non-covered set. The itemset {a₂a₅} is then removed from available itemset list and inserted into the solution. Now the ﬁrst itemset in available itemlist is{a2a₄}. Since item a4

is in non-covered set, the itemset{a₂a₄} is selected and items a₄ is removed from non-covered set. Note that item a₂ appeared in a previous product, it will be placed in the conditional part of product. In the third iteration, itemset {a₁a₂a₃} is selected. After removing items a₁ and a₃ from non-covered set, all items in A are covered and the construction of product approximation is done. As a result, P (A, c_i) is computed using the product approximation P (a₂a₅c_i)P (a₄|a2c_i)P (a₁a₃|a2c_i).

3.3 Learning Algorithm

The learning algorithm is outlined in Figure 3.5. It generates all the frequent, interesting, and class-related itemsets based on Apriori [1].

The input of the algorithm is a training dataset D. The output is the set of frequent, interesting, and class related itemsets F I with their class counts. In order to get the proba- bility of an itemset X with each class, each itemset has a class-counter (denoted by X.count_i) for each class c_i. The probability P (X, c_i) is computed by dividing a class counter X.count_i by the number of training dataset |D|, that is X.count_i/|D| = P (X, c_i). Each itemset X also has a variable X.xV alue to hold the chi-square value for classes.

In lines 1 and 2, the learner scans dataset D once to add all 1-itemsets in F₁and determine

(24)

Algorithm Learner

Input: the training dataset D

Output: the set F of itemsets X with their class counts X.count_i and the chi-square value X.xV alue

1. F₁ = {all 1-itemsets}

2. count X.count_i for all X ∈ F_i 3. for (k = 2; F_k−1 = ∅; k + +) do 4. C_d = CandidateGen(F_k−1);

5. for each data example d∈ D do 6. i = class of D;

7. C_d= SubSetGen(C_k, d);

8. for each candidate X ∈ C_d do

9. X.count_i+ +

10. end

11. end

12. F_k = P runeF (C_k)

13. for each itemset X ∈ Fk do 14. calculate X.xV alue;

15. if (X.xV alue < δ) then

16. X.xV alue = 0

17. end

18. end

19. end

20. F I =_kF_k

Figure 3.5: The Learner algorithm

(25)

CHAPTER 3. CLASSIFICATION METHOD 17 the count of each itemset for each class. Using F_k−1 as a seed set to generate a new set of possibly frequent itemsets of size k, called candidate set. If F_k−1 is not empty, the learner uses CandidataeGen function to generate candidate set of size k, which is assigned to C_k (line 4).

In lines 5-11, the learner scans dataset D to calculate the counts for all itemsets in C_k. For each data example d in D, using SubsetGen function to generate a set of subsets of d that each subset belongs to C_k is selected into C_d(line 7). In lines 8-10, the class count X.count_i of each itemset X ∈ Cd is increased. After the class counting is completed, the P runeF function prunes unfrequent and uninteresting itemsets from C_k using interestingness measure and the remainder are added into F_k (line 12). The interestingness measure is as presented in section 3.1.

In lines 13-17, the learner determines the independency for each itemset in F_kby using chi- square independence test. If an itemset is independent with the class label (i.e the X.xV alue is less than the critical value δ of chi-square independence test), the itemset is treated as a class-independent itemset. Since the classification phase may use class independent itemsets as the denominator part of product approximation, these itemsets cannot be pruned in learning phase. We mark these itemsets by assigning 0 to the X.xV alue. Last, union all F_k as a set F I and return F I for classification phase to build a classifier for a new data object.

3.4 Classiﬁcation Algorithm

In Figure 3.6, we presents the steps of classifying a new data object A by using frequent, interesting and class related itemset in F I. The goal of classiﬁer is to generate and compute the product approximation of P (A, c_i) for each class c_i.

Initially, all items in A are added into a set, called non − coveredset (line 1). The num and den denote the set of itemsets in numerator and denominator part of product approximation. They are all empty initially (lines 2 and 3). In lines 4, the SelectM F I function selects maximum sub-itemsets of A andsorts each itemset by its chi-square value.

An itemset X is a maximum sub-itemset of A if no sub-itemset X of A (X ⊆ A) such that X ⊂ X.

(26)

Algorithm Classiﬁer

Input: the set F discovered in learning phase, and a new data object A Output: the class of A

1. N onCov = A 2. num =∅ 3. den =∅

4. AS = SelecteM F I(F )

5. for (k = 1; N onCov= ∅; k + +) do 6. t = the ﬁrst itemset in AS;

7. if (t∩ NonCov = ∅) then

8. num = num∪ {t};

9. den = den∪ {t ∩ (A − NonCov)};

10. N onCov = N onCov− {t}

11. end

12. AS = AS− {t}

13. end

14. output that class c_i with maximal P (A, c_i) computed as:

15. P (A, c_i) = P (c_i)· ^x∈num^{P (x,c}ⁱ⁾

y∈donP (y,ci)

Figure 3.6: The Classiﬁer algorithm

In lines 5-13, the classifier selects itemsets from AS in turn until non− coveredset is empty. In each iteration, the classifier picks the first itemset t from AS and checks if an item in t is covered (line 6). If t has at least one item not yet covered, the itemset t can be inserted into the solution (line 7). Each product is the conditional probability of the non-covered items of t given the covered ones and the class c_i, as follows.

P (N on− CoveredItems|CoveredItems, ci) = P (t, c_i)

P (t∩ (A − NonCov), c_i) (3.6) In lines 8 and 9, the itemset t is added into the numerator part of the product approx- imation and the set of covered items of t is inserted into the denominator part. Either an itemset t is added in solution or not, the itemset t is removed from AS (line 12). Finally, the classiﬁer computes P (A, c_i) for each class c_i and classiﬁes A to a class, which has the highest probability.

(27)

3.5Zero Counts Smoothing

Consider a conditional probability P (x|y, ci) = P (x, y, ci)/P (y, c_i), a zero counts exists when the numerator or denominator part of a conditional probability equal to zero (i.e.

P (x, y, c_i) = 0 or P (y, c_i) = 0). If the numerator part P (x, y, c_i) = 0, the probability of P (A, c_i) becomes zero which causes Line 14 in the classify algorithm to yield zero regard- less of the probabilities of other attributes. On the other hand, if the denominator part P (y, c_i) = 0, the probability P (A, c_i) is undeﬁned. In order to eliminate the eﬀect of zero counts, LB use the standard zero counts smoothing method as follows.

P (X|Y ) = n_c+ mP

n + m (3.7)

where n_c is the number of X ∩ Y , n is the number of Y , P is the probability of X, and m is the a small smoothing value.

In our method, we also use the same method to deal with zero counts. Instead of P (x|y, ci) = P (x, y, c_i)/P (y, c_i), the smoothed conditional probability is formulated as

P (x|y, ci) = |D|P (x, y, ci) + n₀· P (x)

|D|P (y, c_i) + n₀ (3.8)

where n₀ is a small smoothing parameter.

(28)

Chapter 4 Experimental Results and Discussion

In this chapter, we compare our method, ACC, with NB [2] and LB [7] by using 13 datasets from UCI Machine Learning Database Repository [8] to evaluate the performance. Since the type of attributes must be discrete, we use datasets with categorical type as possible.

All of the attribute types in 13 dataset are categorical or finite integral type. If the integral type is discredited as categorical type presented in documents of the datasets, we use these predefined categorical type. Otherwise, the finite integral type is treated as discredited (i.e.

each value is treated as one item). Figure 4.1 presents the datasets that we used. Column 3 (#Items) shows the number of items, which are mapped from non-class attribute value pair. The mapping method is described in section 3.1.1 Column 6 (#Train) denotes the size of a training set or the total number of a dataset. Column 7 (#Test) indicates the size of testing set or the experiment method, 10-fold cross-validation. The details are presented in the next.

We evaluate the accuracy and the compactness of a classiﬁer by using holdout method or 10-fold cross-validation method. The holdout method splits a databset into two parts:

training set and validation set (usually a 2:1 ratio of examples). The 10-fold cross-validation method partitions a dataset into 10 parts as equal as possible. The experiment for a dataset must be iterated 10 times. In each iteration, using 9 parts as the training set and 1 part as the testing set. The accuracy measure is the average of the 10 folds. In our experiment, if a dataset is pre-partitioned into training set and testing set, we use holdout method for the dataset. Otherwise, using 10-fold cross-validation method for the dataset. In Figure 4.1, the datasets monks1, monks2, and monks3 (datasets 6-8) are measured by using holdout

20

(29)

CHAPTER 4. EXPERIMENTAL RESULTS AND DISCUSSION 21

$WWULEXWHV ,WHP &ODVVHV 0LVVLQJ9DOXHV 7UDLQ 7HVW

EDODQFH &9

EUHDVW &9

FDU &9

FKHVV &9

IODUH &9

PRQNV

QXUVHU\ &9

SRVW &9

7LF7DF7RH &9

YRWLQJ &9

]RR &9

'DWDVHW

Figure 4.1: Description of datasets

method. The remainder datasets are measured by using 10-fold cross-validation, which marked ”CV10” in Column 7 (#Test) with the total number of a dataset shown in Column 6 (#Train).

4.1 Parameter Setting

For association based classiﬁers, the minimum support threshold and the zero count smoothing parameter are set to default value as suggested in the literature. The minimum support threshold is ﬁxed to 1% or 5 occurrences for small datasets and the zero count smoothing parameter n₀ is set to 5 for all datasets. The interestingness threshold was set to 0.04 in LB. However, this value is too large for most of the 13 datasets. It makes LB collapse to NB in 10 datasets. Therefore, we set the interestingness threshold from 0.005 to 0.03 for LB and ACC. The critical value δ of chi-square independence test is determined by the setting of p-value and the degrees-of-freedom. In ACC, the p-value is set to 0.05 for all datasets.

(30)

CHAPTER 4. EXPERIMENTAL RESULTS AND DISCUSSION 22

4.2 Experimental Results

Figure 4.2 shows the result of accuracy measure on the 13 datasets for value of interestingness threshold varying from 0.005 to 0.03. Since no classification method can outperform all others in all possible domains [10], we compare the accuracy of three classifiers in several perspectives. In Figure 4.2, the accuracy of different interestingness threshold is averaged in ”Average” column. The last column ”Winner” shows which classifier achieves the best average accuracy in a dataset. As the figure shows, ACC achieves the best average accuracy in 6 datasets. NB and LB have the best average accuracy in 4 and 3 datasets.

Figure 4.3 provides the individual accuracy of three methods with various interestingness threshold. For each dataset, the most accurate classiﬁer is marked with shadow. The interestingness threshold is increasing from left to right in Figure 4.3. Since the interestingness threshold is increasing, the results of LB and ACC is more similar to NB, and we get the same measures for the three methods. As the ﬁgure shows, the number of the best accuracy for ACC is more than or equal to what NB and LB achieved. That means the result is stable and not sensitive to the variation of interestingness threshold.

Figure 4.4 shows an extreme case, which omits the minimum support threshold (i.e. the minimum support threshold is set to zero for all datasets). The best accuracy is marked with shadow. ACC is stable with diﬀerent interestingness threshold and achieves the best accuracy in most datasets.

4.3 Discussion

4.3.1 The Eﬀect of Missing Value

In 13 datasets, there are two datasets, Post and Voting, which contain missing values. Post contains only 3 missing values. Voting contains 392 missing values, which are 5.6% of all data objects. In this section, we experiment and discuss the eﬀect of missing values in accuracy. Since the Voting dataset has only 5.6% missing values, we increate the percentage of missing values by varying from 10% to 50% for each dataset. The minimum support and interestingness threshold are ﬁxed to 1% and 0.01. Fig 4.5 and Fig 4.6 show the experimental results. In 7 datasets (Fig 4.5 (a) - (d), Fig 4.6(i), and Fig 4.6 (l) - (m)), the accuracies of