國
立
交
通
大
學
工業工程與管理學系
博 士 論 文
粒化計算處理不平衡資料之理論與應用
Granular Computing for Imbalanced Data: Theory and Applications
研 究 生:陳隆昇
指導教授:蘇朝墩、李榮貴 教授
粒化計算處理不平衡資料之理論與應用
Granular Computing for Imbalanced Data: Theory and Applications
研 究 生:陳隆昇 Student:Chen, Long-Sheng
指導教授:蘇朝墩 Advisor:Su, Chao-Ton
李榮貴
Li, Rong-Kwei
國 立 交 通 大 學
工業工程與管理學系
博士論文
A DissertationSubmitted to Department of Industrial Engineering and Management College of Management
National Chiao Tung University in partial Fulfillment of the Requirements
for the Degree of Doctor of Philosophy
in Industrial Engineering and Management
March 2006
粒化計算處理不平衡資料之理論與應用
學生:陳隆昇
指導教授:蘇朝墩
李榮貴
國立交通大學工業工程與管理學系博士班
摘
要
近年來機器學習的發展為分類問題提供一項有效的工具。然而,當從不平衡 資料(imbalanced data)學習時,傳統的方法在預測少數範例(minor examples)上, 其能力是不足的。這類的問題相當重要,在許多環境、生命相關或商業重要領域 中大量發生,譬如詐騙偵測、文字探勘、垃圾信件偵測、醫療診斷、錯誤監視及 檢測等。在本論文中,我們提出稱為「粒化計算」(Granular Computing)的新穎方 法來解決這種「類別不均問題」(Class Imbalance Problems)。
粒化計算以表示和處理資訊粒(Information Granule)為導向,是一種模仿人類 資訊處理本能的計算模式,逐漸在資訊科學、邏輯、哲學等領域中成為一項重要 的議題。當描述一個包含不完整、不確定或是模糊資訊的問題時,人類很難去考 慮詳細的數值資料,而被迫考慮『資訊粒』—是由個別元素(individual elements) 依據其相似性、功能接近性或是不可分辨度所構成的集合。粒化計算的模型不僅 可以移除不必要的細節、使我們看清資料的本質,更能有效地用來解決『類別不 均問題』。
本研究的目的在於發展出兩種粒化計算模型—「KAIG」與「IG based method」 分 別 處 理 離 散 型 (discrete) 與 連 續 型 (continuous) 資 料 。 兩 個 模 型 中 , 兩 種 指 標—H-index 與 U-ratio,被成功地導入以用來確定適當的顆粒性水準(level of granularity),換言之,我們可以據此來確定適當的資訊粒數目。模糊適應共振理
論網路(Fuzzy ART neural network)被用來建構資訊粒。此外,在「KAIG」模型 中,我們提出了「附屬屬性(sub-attributes)」的觀念來描述資訊粒並可解決資訊粒 彼此重疊的現象。在「IG based method」方法中,我們則是以資料特性來表示資 訊粒。本研究的主要目標詳述如次:
(1)發展 KAIG 模型來建構資訊粒,並從其中攫取知識。七個 UCI 資料銀行中的 資料(包含一個不平衡診斷資料),被用來評估 KAIG 模型的有效性,在使用不同 的績效指標(如 Overall Accuracy, G-mean 和 ROC curve)評估下,相較於決策樹 方法(decision tree, C4.5)與支持向量分類器(Support Vector Machine),實驗結果說 明了我們所提方法的優異性。
(2)應用 KAIG 模型解決工業工程相關領域中的「類別不均問題」。首先,在模 擬的彈性製造系統(Flexible Manufacturing Systems)環境中,KAIG 模型被應用來 改善動態排程系統的分類績效。其次,我們以一個手機檢測的實際案例來說明 KAIG 模型有極優異的能力偵測出極少數的不良品。此外,KAIG 模型可以減少 多餘的測試項目並縮短檢驗時程。這兩個應用實例證實對於處理不平衡資料, KAIG 模型可以大幅提昇偵測少數範例的能力 (Negative Accuracy),同時又不會 減少整體的分類準確率(Overall Accuracy)。
(3)提出「IG based method」來處理連續型的不平衡資料。在這個方法中,不同的 資料特性及其組合被用來表示建構好的資訊粒,然後再利用這些資訊粒的代表來 建立分類器。一個糖尿病醫療診斷實例被用來評估所提方法的有效性。相較於傳 統的方法,本研究所提的方法在不平衡資料的學習上表現出極佳的結果。
關鍵字: 粒化計算、資訊粒化、類別不均問題、模糊適應共振理論網路、知識攫
Granular Computing for Imbalanced Data: Theory and Applications
student:Chen, Long-Sheng
Advisors:Dr. Su, Chao-Ton
Dr. Li, Rong-Kwei
Department of Industrial Engineering and Management
National Chiao Tung University
ABSTRACT
In recent years, the development of machine learning techniques has provided an effective avenue for classification problems. However, when learning from imbalanced data, the traditional methods have poor predictive ability to identify minority instances. This problem is of crucial importance since it is encountered by a large number of domains of great environmental, vital or commercial importance such as fraud detection, text mining, spam detection, medical diagnosis and fault monitoring/inspection. In this study, we propose novel methods called “Granular Computing” models to tackle class imbalance problems.
Granular computing, which is oriented towards representing and processing Information Granules (IGs), is a computing paradigm that embraces a number of modeling frameworks. GrC imitates human instincts of processing information and is becoming a very important issue for computer science, logic, philosophy and others. When describing a problem which involves incomplete, uncertain, or vague information, we human beings tend to shy away from numbers and use aggregates to ponder the question instead. We are forced to consider IGs which are collections of entities arranged together due to their similarity, functional adjacency and indistinguishability. GrC model not only can remove unnecessary details and provide a better insight into the essence of data, but also effectively solve class imbalance problems.
This study aims to develop two kinds of GrC models, “Knowledge Acquisition via Information Granulation” (KAIG) model and “Information Granules based method” (IG based method), for dealing with discrete and continuous data, respectively. In both models, the homogeneity index (H-index) and the
undistinguishable ratio (U-ratio) are successfully introduced to determine a suitable level of granularity (i.e. determine suitable number of IGs). Fuzzy Adaptive Resonance Theory (Fuzzy ART) neural network is utilized to construct IGs. In addition, we propose the concept of “sub-attributes” to describe granules and tackle the overlapping among granules in KAIG model. In IG based method, data characteristics are employed to represent IGs. The main objectives of this study are: 1. Develop a KAIG model to construct IGs, and to discover knowledge from IGs.
Seven data sets from UCI data bank (including one imbalanced diagnosis data), are provided to evaluate the effectiveness of KAIG model. By using different performance indexes, Overall Accuracy, G-mean and ROC curve, the experimental results comparing with C4.5 and Support Vector Machine (SVM) demonstrate the superiority of our method.
2. Apply KAIG model to solve class imbalance problems in industrial engineering related areas. First, KAIG model is utilized to improve the classification performance of a dynamic scheduling system within a simulated Flexible Manufacturing System environment. Second, a real case of cellular phones inspection is provided to illustrate the excellent ability of KAIG model in identifying rare defective products. In addition, KAIG model can reduce redundant test items and shorten inspection time. For imbalanced data, these applications show KAIG model can dramatically increase Negative Accuracy (the capability of detecting minor instances) without losing Overall Accuracy.
3. Propose IG based method to deal with continuous imbalanced data. In this method, different data characteristics and their combinations are employed to denote constructed IGs. Then we build a classifier from these representatives of IGs. An actual medical diagnosis data of diabetes is used to evaluate the effectiveness of this method. Compared with traditional techniques, the proposed method is shown to be superior for learning on imbalanced data.
Key words: Granular computing, Information granulation, Class imbalance problems,
CONTENTS
摘要 i
ABSTRACT ii
CONTENTS iv
LIST OF TABLES vii
LIST OF FIGURES viii
CHAPTER 1 INTRODUCTION 1
1.1 Research Motivations 1
1.2 Research Objectives 3
1.3 Framework and Organization 4
CHAPTER 2 RELATED WORKS 6
2.1 Granular Computing 6
2.2 Class Imbalance Problems 8
2.3 Fuzzy ART Neural Network 9
2.4 Inductive Learning Methods 11
2.4.1 Decision Tree 11
2.4.2 Back-propagation Neural Network 11
2.4.3 Rough Sets 14
2.4.4 Support Vector Machines 15
2.5 Feature Selection From Imbalanced Data 17
CHAPTER 3 PROPOSED GRANULAR COMPUTING MODELS 20
3.1 Construction of Information Granules 20
3.2 Selection of Granularity 22
3.3 Representation of Information Granules 24
3.3.1 The Concept of “Sub-attributes” 24
3.3.2 Using Data characteristics to Represent IGs 26
3.4 Proposed Methodologies 26
3.4.1 The KAIG Model 27
3.4.2 The IG Based Model 28
CHAPTER 4 NUMERICAL EXAMPLES 29
4.2 Implementation of KAIG Model 31
4.2.1 Illustrative Example 31
4.2.2 Evaluation of KAIG Model 37
4.2.3 Implementation in Imbalanced Data 39
4.2.4 Discussion and Concluding Remarks 43
4.3 Implementation of IG Based Method 44
4.3.1 Illustrative Example 44
4.3.2 Experimental Results 46
4.3.3 Discussion and Concluding Remarks 49
CHAPTER 5 APPLY KAIG MODEL TO BUILD A GRANULAR COMPUTING BASED SCHEDULING SYSTEM
51
5.1 Problem Description 51
5.2 A Granular Computing Based Scheduler 52
5.2.1 Information Granulation Mechanism 52
5.2.2 Inductive Learning Mechanism 53
5.3 Comparative Techniques 55
5.3.1 Cost Adjusting Method 55
5.3.2 Cluster Based Sampling Method 55
5.4 Implementation 57
5.4.1 Description of Simulated System 57
5.4.2 Using the Cost Adjusting Method 58
5.4.3 Using the Cluster Based Sampling Method 58
5.5 Experimental Results 59
5.6 Discussion and Concluding Remarks 61
CHAPTER 6 APPLY KAIG MODEL TO SHORTEN THE CELLULAR PHONE TEST PROCESS
63
6.1 Problem Description 63
6.2 Proposed Feature Selection Procedure 64
6.3 Case Study 67
6.3.1 The Problem 68
6.3.2 Data Collection 68
6.3.3 Data Preparation 69
6.3.5 Feature Selection and Knowledge Acquisition 74
6.4 The Benefits 74
6.5 Discussion and Concluding Remarks 75
CHAPTER 7 APPLY IG BASED METHOD TO ENHANCE THE DIABETES DIAGNOSIS ABILITY
77
7.1 Problem Description 77
7.2 Data Collection 78
7.3 Implementation of IG based method 78
7.4 Discussion and Concluding Remarks 81
CHAPTER 8 DISCUSSIONS AND CONCLUSSIONS 82
8.1 Summary 82
8.2 Further Research 84
LIST OF TABLES
Table 3.1 The information granule- iris example 22
Table 3.2 The undistinguishable granule 23
Table 3.3 Two IGs represented by hyperbox form 24
Table 3.4 The IGs with sub-attributes 25
Table 4.1 Confusion matrix for binary class problem 30
Table 4.2 The IGs with the similarity of 0.55 33
Table 4.3 The IGs with sub-attributes 35
Table 4.4 The minimal reduct of IGs for testing 35
Table 4.5 A comparison of processing with information granules and
numerical data
36
Table 4.6 The background of five data sets 36
Table 4.7 The setting of parameters in neural network (BP) 39
Table 4.8 The comparison of classification performance 40
Table 4.9 The results in different proportion of minor class examples 41
Table 4.10 An illustrative example of IG based method 45
Table 4.11 Data background (UCI) 47
Table 4.12 The experimental results of IG based method 48
Table 4.13 The experimental results of sampling methods 49
Table 5.1 The experimental results of FMS simulated data 62
Table 6.1 Test items of the RF function 69
Table 6.2 The information granules described as hyperbox form 71
Table 6.3 The IGs with addition of sub-attributes 72
Table 6.4 The implementation results by rough sets 73
Table 6.5 The implementation results by decision tree (C4.5) 73
Table 6.6 The implementation results by BPNN (full attributes) 73
Table 7.1 Attributes 78
Table 7.2 Using Q1+Q2+Q3 to describe IGs 79
LIST OF FIGURES
Figure 1.1 Research framework 5
Figure 2.1 An information-processing pyramid 7
Figure 2.2 Topological structure of the Fuzzy-ART 10
Figure 2.3 The back-propagation neural network structure 14
Figure 2.4 The operation of Support Vector Machine 16
Figure 3.1 The overlap between IGs 25
Figure 3.2 Knowledge Acquisition via Information Granulation (KAIG)
model 27
Figure 4.1 The H-index and U-ratio of the iris data 32
Figure 4.2 The performance of classification (Iris data) 32
Figure 4.3 The H-indexes and U-ratios of five data sets 38
Figure 4.4 ROC curves of pima-indian-diabetes data 42
Figure 4.5 H-index and U-ratio of Haberman’s survival data 44
Figure 4.6 Overall accuracies of different strategies (UCI) 47
Figure 4.7 G-mean of different strategies (UCI) 48
Figure 4.8 Comparison of the proposed IG based, cluster-based sampling,
under-sampling, DT, and SVM 49
Figure 5.1 An Granular Computer based scheduler 54
Figure 5.2 Illustration of cluster based sampling method 56
Figure 5.3 Overall Accuracy, Positive Accuracy, Negative Accuracy &
G-mean of SVM with different costs 58
Figure 5.4 Overall Accuracy, Positive Accuracy, Negative Accuracy &
G-mean of cluster based sampling with different proportions (majority: minority)
59
Figure 5.5 The comparison of classification performances 60
Figure 5.6 ROC curves of FMS data 60
Figure 6.1 A manufacturing process of cellular phone 66
Figure 6.2 Basic idea of the proposed methodology 67
Figure 7.1 H-index & U-ratio of diabetes diagnosis data 80
誌
謝
首先要感謝的人,是我的指導教授-蘇朝墩老師。雖然他總是說:「我沒幫 上甚麼忙」,但我心裡知道,從論文撰寫、投稿、國際會議參與、六標準差認證 到千里馬計畫的申請;從寫作、報告技巧指導,生涯規劃的建議到人生經驗的傳 承與分享,沒有他的指導,我無法順利取得學位。從他身上,我學到的是一份堅 毅、努力不懈的研究態度,引領我邁向人生的康莊大道。口試期間,承蒙口試委 員李榮貴教授、洪瑞雲教授、清華大學溫于平教授、大葉大學駱景堯教授及北科 大陳穆臻教授,不吝給予指教,使得本論文更臻完整。同時我也要感謝李榮貴和 洪瑞雲老師的幫忙,讓我能夠順利完成學業。此外,在普渡大學(Purdue University) 訪問期間,工業工程學院的易玥文(Y. Yih)教授的細心指導,及朝龍、國浩、宗 翰、Sandra、Steven 及其他台灣留學生在生活上的幫忙,在此也一併致上萬分謝 意。 三年六個多月的時間是苦悶的,沒有學長許志華、許俊欽、劉正祥及同門楊 健炘、黃榮輝、楊宗銘、林敬森、周家任、蕭宇翔、彭加景的陪伴,相互的服持 與鼓勵,我很難撐到此刻。謝謝你們為我做的一切,希望你們都能平安順利。 最後,我要將這本論文獻給我的雙親及家人,沒有你們在背後默默地支持與 鼓勵,就沒有今天的我。特別是我的妻子-林昕怡小姐,在我最艱難的時刻,他 沒有對我失去信心,讓我有勇氣可以堅持下去。爾後,我會竭盡所能地貢獻一已 微薄之力於社會,將這份感謝傳遞下去。 陳隆昇 Mar. 16, 2006
CHAPTER 1
INTRODUCTION
1.1 Research Motivations
When learning from imbalanced/skewed data, which almost all the instances are labeled as one class while far few instances are labeled as the other class, traditional machine learning algorithms such as Neural Networks (NN), Decision Trees (DT), and Support Vector Machines (SVM) tend to produce high accuracy over the majority class but poor predictive accuracy over the minority class. This minority class is usually the important one, like illness patients of medical diagnoses examples or abnormal products of finished-goods inspection data. This study tries to solve these Class Imbalance problems which caused by skewed data distribution.
There are two motivations why we propose the Granular Computing (GrC) to tackle class imbalance problems. The first one is human instinct (Zadeh, 2001). As human beings, we have developed a granular view of the world. When describing a problem or making decisions, we tend to shy away from numbers and use aggregates to ponder the question instead. This is especially true when a problem involves incomplete, uncertain, or vague information. It may be sometimes difficult to differentiate distinct elements, and so one is forced to consider “information granules” (IG) which are collections of entities arranged together due to their similarity, functional adjacency and indistinguishability (Bargiela and Pedrycz, 2003; Castellano and Fanelli, 2001; Yao and Yao, 2002; Zadeh, 1979). A typical example is the theory of rough sets (Walczak and Massart, 1999).
The process of constructing IGs is referred to as information granulation. This was first pointed out in the pioneering work of Zadeh (1979) who coined the term
‘information granulation’, and emphasized the fact that a plethora of details does not necessarily amount to knowledge. Granulation serves as an abstraction mechanism for reducing an entire conceptual burden. The essential factor driving the granulation of information is the need to comprehend the problem and have a better insight into its essence, rather than get buried in all the unnecessary details. By changing the size of the IGs, we can hide or reveal more or less details (Bargiela and Pedrycz, 2003). Granular Computing (GrC) is oriented towards the representation and processing of IGs.
The second motivation is about the behavior of data. In many practical datasets, such as medical/diagnosis, inspection, fault monitoring and fraud detecting data, the normal group and abnormal group are considered separate populations. Taguchi and Juoulum (2002) thought every abnormal condition (or a condition outside “healthy” group) is considered unique, since the occurrence of such a condition is different. Tolstoy’s quote in Anna Karenina: “All happy families look alike. Every unhappy family is unhappy after its own fashion” is also noted to illustrate their opinions (Taguchi and Juoulum, 2002). From the observations of practical data, we can clearly find the normal group (i.e. healthy patients, good products) look alike while the abnormal group (i.e. sick patients, defective products) are unique. If we construct IGs by the similarity of numerical data, the amount of IGs in normal group will be remarkably smaller than the size of normal numerical data. In other words, if we consider IGs instead of numerical data, it might increase the proportion of abnormal data and improve imbalanced/skewed situation of data.
1.2 Research Objectives
The purpose of this study is to develop two Granular Computing models to deal with imbalanced/skewed data. These two models can extract knowledge from IGs and are developed for discrete and continuous data, respectively. The main issue needed to tackle is how to measure and represent IGs if we want to acquire knowledge from IGs. In this study, we use Fuzzy Adaptive Resonance Theory (Fuzzy ART) neural network to construct IGs. The two indexes, the homogeneity index (H-index) and the undistinguishable ratio (U-ratio), are presented to measure IGs. In the first proposed model called “Knowledge Acquisition via Information Granulation” (KAIG), the concept of “sub-attributes” is presented to describe granules, and to tackle the overlapping among granules. In the second proposed approach called “Information Granules based method”, we try to use different data characteristics such as mean, median, quartiles, minimum, maximum and combinations of them to represent IGs. Then, we extract knowledge from these IGs.
The KAIG model is designed for discrete imbalanced data. We will evaluate KAIG model by using UCI data and make a comparison between KAIG model and traditional knowledge acquisition algorithms which operate with numerical data. In addition, KAIG model is applied to deal with class imbalance problems in a dynamic scheduling problem within a simulated Flexible Manufacturing System. Besides, this study developed a feature selection procedure integrated the proposed KAIG model to find key test items and shorten inspection time. A real case of mobile phone inspection in Taiwan was used to evaluate effectiveness of the proposed procedure. We also show advantages and benefits of this procedure.
The IG based method is proposed to deal with continuous imbalanced data. The experimental results will be compared with cluster-based sampling method and
original machine learning techniques. Finally, an actual medical diagnosis data of diabetes is employed to illustrate the superiority of our method.
1.3 Framework and Organization
In practical application of machine learning (or data mining), class imbalance
problems are emerging issues. According the report of available researches, sampling and moving decision threshold are widely used methods to tackle this problem. This study developed two kinds of GrC model, which is also new topic in information processing, to solve the class imbalance problems. The developed models will be compared with previous techniques. The research framework is shown in Figure 1.1.
This study is organized as follows. Chapter 1 presents the research motivations and objectives. Chapter 2 is the literature review of related researches toward granular computing, class imbalance problems, inductive learning and feature selection techniques. Chapter 3 proposes two GrC methodologies. In this chapter, we use Fuzzy ART neural network to construct IGs, present “H-index & U-ratio” to determine the suitable level of granularity, and develop the concept of “sub-attributes” and “data characteristics” to describe IGs. In chapter 4, several data sets from UCI machine learning group are provided to illustrate and evaluate the effectiveness of our methodologies. Chapter 5 describes the applications of KAIG model in dynamic scheduling system within a simulated FMS. In Chapter 6, we develop a KAIG model based feature selection procedure to reduce test items and shorten inspection time in mobile phone manufacturing. Chapter 7 provides a case study of diabetes diagnosis by using IG based method. Finally, conclusions and future works are described in Chapter 8.
Figure 1.1 Research framework Granular Computing Section 2.1 Zadeh (1979 & 2001) Lin (1998) Yao (1999)
Castellano and Fanelli (2001) Yao and Yao (2002)
Bargiela and Pedrycz (2003) Castellano and Fanelli, (2001)
Granularity Selection
IG Construction
Carpenter et al. (1991) Burke and Kamal (1995) Serrano-Gotarredona et al. (1998)
IG representation
Bargiela and Pedrycz (2003)
Class Imbalance Problems
(Section 2.2)
Altincay & Ergun (2004) Moving the decision threshold Cluster-based sampling Huang et al. (2004) Jo & Japkowicz (2004) Inductive Learning Section 2.4 Sampling Under-sampling & Over-sampling Cost-adjusting (Weight-varying) Chawla et al. (2002) Batista et al. (2004) Guo & Viktor (2004)
Cristianini & Shawe-Taylor (2000) KAIG model (Section 3.4.1) IG based method (Section 3.4.2)
Discrete Data Continuous Data Knowledge Acquisition from IGs
★IG construction: Fuzzy ART neural network (Section 3.1)
★Granularity selection: H-index & U-ratio (Section 3.2)
★IG representation (Section 3.3)
-Sub-attributes (Section 3.3.1) -Data characteristics (Section 3.3.2)
Application: ★ Dynamic scheduling
problem (Chapter 5)
★ Mobile phone inspection (Chapter 6)
Application: ★ Diabetes diagnosis
CHAPTER 2
RELATED WORKS
2.1 Granular Computing
Humans have a remarkable capability to perform a wide variety of physical and mental tasks without any measurements and any computations, such as driving, parking, cooking, and playing computer game. We human beings use perceptions of direction, speed, time and other attributes of physical/mental objects, instead of numerical data. Basically speaking, reflecting the limited ability of human brains, perceptions are inaccurate. In more concrete terms, perceptions are granular. It means that the boundaries of perceived classes are unsharp; and the values of attributes are granulated (Zadeh, 2001). For example, the granules of temperature might be labeled very cold, cold, warm, hot, very hot, etc. The computation theory of perceptions (CTP) is inspired by the marvelous human ability. And, GrC belongs to related research areas of CTP.
GrC is quickly becoming an emerging conceptual and computing paradigm of information processing (Bargiela and Pedrycz, 2003). It is a superset of the theory of fuzzy information granulation, rough set theory and interval computations, and is a subset of granular mathematics. GrC as opposed to numeric computing is knowledge-oriented. Numeric computing is data oriented. The main issues (Castellano and Fanelli, 2001) of granular computing are how to construct the IGs, and to describe IGs. One particular question that arises is how to determine the level of granularity. If we want to acquire knowledge from IGs, we must try to solve these three questions which will be discussed in sections 3.1~3.3.
Organizing Map (SOM) network (Bortolan and Pedrycz, 2002), Fuzzy C-means (FCM) (Castellano and Fanelli, 2001; Bargiela and Pedrycz, 2003b), rough sets, shadowed sets (Bargiela and Pedrycz, 2003a) used to do this. Because IGs exist at different levels of granularity, we usually group granules of similar “size” (that is granularity) in a single layer. If more detailed processing is required, smaller IGs are selected. Figure 2.1 illustrates this concept of granularity. At the lowest level, we are concerned with numeric processing. This is a domain completely taken over by numeric models, such as differential equations, regression models, neural networks, etc. At the intermediate level, we see larger IGs (viz. those embracing more individual elements). The top level is solely devoted to symbol-based processing, and as such invokes well-known concepts of Petri nets, qualitative simulation, etc (Bargiela and Pedrycz, 2003a).
In the issue of represntting IGs and determining the level of granularity, Bargiela and Pedrycz (2002) proposed the “hyperbox” and “inclusion & compatibility” to measure IGs. However, these researches focused on how to construct IG, how to describe IG and how to measure IG, individually. We need an advanced/integrated mechanism to imitate human ability of processing information, such as extracting knowledge from IGs and making decision based on them.
Figure 2.1 An information-processing pyramid (Bargiela & Pedrycz, 2003) high
Granularity
2.2 Class Imbalance Problems
Learning from imbalanced/skewed data is an important topic and rises very often in practice. In such kind of data, one class might be represented by a large number of examples while the other is represented by only a few. Many real world data have these characteristics, such as fraud detection, text classification (Chawla et al., 2002& 2004) telecommunications management, oil spill detection, risk management, medical diagnosis/monitoring, financial analysis of loan policy or bankruptcy (Batista et al., 2004; Chawla et al., 2004; Grzymala-Busse et al., 2004) and protein data (Provost and Fawcett, 2001). Traditional classifiers seeking an accurate performance over a full range of instances are not suitable to deal with imbalanced learning tasks (Batista et al., 2004; Chawla et al., 2004; Guo and Viktor, 2004; Japkowicz and Stephen, 2002) since they tend to classify all data into the majority class, which is usually the less important class. Therefore, these traditional algorithms often produce high accuracy over the majority class, but poor predictive accuracy over the minority class.
To cope with imbalanced data sets, there are some methods proposed in literatures. There are two major groups of techniques designed to address class imbalance. The first group consists of supervised techniques that usually include five approaches: (1) undersampling, methods in which the minority population is kept intact, while the majority population is under-sampled, (2) oversampling, methods in which the minority examples are over-sampled so that the desired class distribution is obtained in the training set (Batista et al., 2004; Chawla et al., 2002; Guo and Viktor, 2004), (3) cluster based sampling, methods in which the representative examples are randomly sampled from clusters (Altincay and Ergun, 2004), (4) moving the decision threshold, methods in which researchers try to adapt the decision thresholds to impose bias on the minority class (Chawla et al., 2002; Huang et al., 2004; Jo and Japkowicz,
2004) and (5) adjust costs matrices, methods in which the prediction accuracy is improved by adjusting the cost (weight) for each class(Cristianini and Shawe-Taylor, 2000).
The second large class of techniques for detecting rare events involves an unsupervised framework, i.e. outlier detection or one-class classification (Manevitz and Yousef, 2001). Initially, minority examples are completely ignored and a model is trained by using all examples from the majority class (target class). Then, the outliers are detected as the data points with low probability of occurrence, small number of neighboring examples. In addition, SVM is usually used to tackle class imbalance problem (Wu and Chang, 2005).
These techniques have some disadvantages (Altincay and Ergun, 2004). For example, the computational load is increased and overtraining may occur due to the replicated samples in the case of over-sampling. Under-sampling does not take into account all available training data which corresponds to loss of available information. Huang et al. (2004) thought these supervised methods lack a rigorous and systematic treatment on imbalanced data. Moreover, the one-class classification methods only consider majority examples. It might miss some beneficial decision information of minority examples.
2.3 Fuzzy ART Neural Network
Fuzzy ART is one of clustering techniques and also the most recent adaptive resonance framework that provides a unified architecture for both binary and continuous valued inputs. Fuzzy ART clusters vectors based on two separate distance criteria, match and choice. For input vector I and category j, the match function is defined by
I w I I
Sj( )= ∧ j (2.1)
where w is an analog-valued weight vector associated with cluster j. j ∧ denotes
the fuzzy AND operator, (p∧q)i =min(pi,qi), and the norm .... is defined by
∑
= i i p p .The choice function is defined by
j j j w w I I T + ∧ = α ) ( (2.2)
where α is a small constant. Increasing α biases the search more towards clusters with large w . Each input vector is assigned to the category that maximizes j Tj(I)
while satisfying Sj(I)≥ρ, where the vigilance ρ, is a constant, 0≤ρ ≤1. The topological structure of the Fuzzy ART architecture is shown in Figure 2.2.
F2 (cluster units) 1 y y2 y3 yM 1 x x2 x3 xN F1 (input units) ji
w
w
ij j T 1 I I2 I3 IN | | I ρ | | X Comparator Reset2.4 Inductive Learning Methods
2.4.1 Decision Tree
The decision tree method is one of the most popular knowledge acquisition algorithms, and has been successfully applied in many areas. Decision tree algorithms, such as ID3 and C4.5, were originally intended for classification purposes. The core of C4.5 contains recursive partitioning of the training examples. Whenever a node is added to a tree, some subsets of the input features are used to pick the logical test at that node. The feature that results in the maximum information gain is selected for testing at that node. In other words, the algorithm chooses the “best” attribute to partition the data into individual classes at each node. After the test has been determined it is used to partition the examples, and the process is continued recursively until each subset contains examples of one class or satisfies some statistical criteria (Su and Shiue, 2003).
In this study, See5 (C4.5 commercial version) software was utilized to construct a decision tree. In See5 there are two parameters that can be tuned during the pruning phase: the minimal number of examples represented at any branch of any feature-value test; and the confidence level of pruning. In order to avoid the occurrence of overfitting and generating a simple tree, 2 was set as the minimum number of instances at each leaf, and the confidence level for pruning was set at 25%.
2.4.2 Back-propagation Neural Network
Neural nets have been used widely in pattern recognition, function approximation, optimization, and clustering. Generally speaking, neural nets can be classified into two categories, feed-forward and feedback networks. In this study, the
feed-forward network, shown as Figure 2.3, was employed because of their superior ability of classification.
The back-propagation learning algorithm (Rumelhart & McClelland, 1986) is the best known training algorithm for neural networks and still one of the most useful. This iterative gradient algorithm is designed to minimize the mean square error between the actual output of a multilayer feed-forward perceptron and the desired output. According to the rule of thumb and reports of available published papers, the number of hidden layers should be one or two. The back-propagation algorithm includes a forward pass and a backward pass. The purpose of the forward pass is to obtain the activation value and the backward pass is to adjust weights and biases according to the difference between the desired and actual network outputs. These two passes will go through iteratively until the network converges. The feed-forward network training by back-propagation can be summarized as the following steps:
Step 1: Select an architecture Step 2: Randomly initialize weights Step 3: While error is too large
For each training pattern (presented in random order)
Step 3.1: Select training pattern and feedforward to find actual network output Step 3.1.1: Apply the inputs to the network
Step 3.1.2: Calculate the output for every neuron from the input layer, through the hidden layer(s), to the output layer
The output from neuron j for pattern p is O where pj
j net j pj e net O − + = 1 1 ) ( (2.3) and
∑
+ = k jk pk j bias O W net (2.4)weight on the connection from input k to neuron j.
Step 3.2: Calculate errors and backpropagate error signals Step 3.2.1: Calculate the error at the outputs
The output neuron error signal δpj is given by ) O -(1 O ) O -(Tpj pj pj pj pj = × × δ (2.5)
where T is the target value of output neuron j for pj pattern p and O is the actual output value of output pj neuron j for pattern p.
Step 3.2.2: Use the output error to compute error signals for pre-output layers
The hidden neuron error signal δ is given by pj ) ) 1 ( kj k pk pj pj pj =O −O
∑
δ W δ (2.6)where δ is the error signal of a post-synaptic pk
neuron k and W is the weight of the connection kj
from hidden neuron j to the post-synaptic neuron k.
Step 3.3: Adjust weights
Step 3.3.1: Use the error signals to compute weight adjustments Compute weight adjustments ∆Wji at time t by
1) -(t W O (t) Wji = × pj × pi + ×∆ ji ∆ η δ α (2.7)
where η is the learning rate and α is the
momentum coefficient (α∈[0,1]). Step 3.3.2: Apply the weight adjustments
Apply weight adjustments according to
) (t W (t) W 1) (t Wji + = ji +∆ ji (2.8)
2.4.3 Rough Sets
The rough sets theory was introduced by Pawlak (1985) to deal with imprecise or vague concepts (Swiniarski and Skowron, 2003; Walczak and Massart, 1999). Rough sets deal with information represented by a table called the information system which contains objects and attributes. An information system is composed of a 4-tuple as follows: f V Q U S = , , , (2.9)
where U is the universe, a finite set of N objects {x1,x2,….xN}, Q is a finite set of
attributes, V =∪q∈QVq, where Vq is a value of attribute q, and f :U×Q→V is the
total decision function called the information function such that f(x,q)∈Vq for every q∈ , Q x∈U. For a given subset of attributes A⊆ the IND(A) Q
{
( , ) :for alla A, (x,a) (y,a)}
)
(A x y U f f
IND = ∈ ∈ = (2.10)
is an equivalence relation on universe U (called an indiscernibility relation). Some of the information systems can be designed as a decision table
Figure 2.3 The back-propagation neural network structure
Decision table= U,C∪D,V, f (2.11)
where C is a set of condition attributes, D is a set of decision attributes,
q D C
q V
V =∪ ∈ ∪ , where Vq is the set of values of attribute q∈Q , and
V D C U
f : ×( ∪ )→ is a total decision function (decision rule in a decision table) such that f(x,q)∈Vq for every q∈ and Q x∈V.
For a given information system S, a given subset of attributes A⊆ Q
determines the approximation space AS =(U,IND(A)) in S. For a given A⊆ Q and X ⊆ (a concept of X), the A-lower approximation U AX of set X in AS and
A-upper approximation AX of set X in AS are defined as follows:
{
x U x X}
{
Y A Y X}
X A = ∈ :[ ]A ⊂ =∪ ∈ *: ⊆ , (2.12){
∈ ∩ ≠∅}
=∪{
∈ ≠∅}
= x U x X Y A Y X X A :[ ]A *: I (2.13)where A denotes the set of all equivalence classes of IND(A). The process of *
finding a set of attributes smaller than the original one with the same classificatory power as the original set is called attribute reduction. A reduct is the essential part of an information system (subset of attributes) which can discern all objects discernible by the original information system. By means of the dependent properties of the attributes we can find a reduced set of attributes, providing that by removing the superfluous attributes there is no loss in classification accuracy.
2.4.4 Support Vector Machines
SVM is a powerful learning method and often employed to tackle class imbalance problems (Wu and Chang, 2005). SVM learns a decision boundary between two classes by mapping the training data (through kernel functions) onto a
higher dimensional space, and then finding the maximal margin hyperplane within that space. Finally, this hyperplane can be viewed as a classifier. Figure 2.4 illustrates the concept of feature mapping and two-class separation.
Consider a classifier, which uses a hyperplane to separate two class of patterns based on given examples =
{
,}
=1, i∈{
−1,+1}
.n i i
i y y
x
S The hyperplane is defined by
) ,
(w b , where w is a weight vector and b a bias. Let w and 0 b denote the 0 optimal values of the weight vector and bias. Correspondingly, the optimal hyperplane can be written as
0 0 0x+ b =
wT (2.14)
To find the optimum values of w and b, it requires to solve the following
optimization problem. ξ , , min b w
∑
= + n i i T C w w 1 2 1 ξ Subject to 0 1 ) ) ( ( ≥ − ≥ + i i i T i w x b y ξ ξ φ (2.15)where ξ is the slack variables, C is the user-specified penalty parameter of the error term (C>0), and φ is the kernel function.
In this research, we used the LIBSVM (version 2.8), which is an integrated tool
Kernel function
Support vector
Input space Feature space
Hyperplane
Margin
Fig. 2.4 The operations of Support Vector Machine
Pos example Neg example
for support vector classification and regression, and is available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. We used the standard parameters of the algorithm. All optimal parameters can be automatically generated in this program and the default kernel function is Radial Basis Function (RBF).
2.5 Feature Selection From Imbalanced Data
Reduction of pattern dimensionality via feature selection belongs to the most fundamental steps in data processing (Swiniarski and Hargis, 2001). A large feature set often contains redundant and irrelevant information, and can actually degrade the performance of the classifier (Oyeleye and Lehtihet, 1998). The main purpose of feature selection is to remove irrelevant or redundant attributes and improve the performance of classification.
Feature selection is often applied in pattern classification, data mining, as well as in machine learning. Among many feature selection methods, GA, rough sets and neural networks have attracted much attention, and have become popular techniques for feature selection. However, when these methods are applied to imbalanced data, it usually suffers from some drawbacks, such as ignoring the minority examples and viewing them as outliers. It was reported (Batista et al., 2004; Chawla et al., 2004) that these methods seeking an accurate performance over a full range of instances are not suitable to deal with imbalanced learning tasks since they tend to classify all data into the majority class, which is usually the less important class. This is because typical classifiers that are designed to optimize overall accuracy without taking into account the relative distribution of each class.
Rough sets emerged as a major mathematical tool for discovering knowledge and feature selection (Walczak, B. and D. L. Massart, 1999). One of the fundamental
principles of a rough set-based learning system is discovering redundancies and dependencies between the given features of a problem to be classified. A reduct generated by the rough sets approach is defined as the minimal subset of attributes that enables the same classification of objects with full attributes. When applying rough sets in practice, its computational complexity increases dramatically with the growth of the data. In addition, the deterministic mechanism for the description of error is very simple in rough sets. Therefore, the rules generated by rough sets are often unstable and have a low classification accuracy (Li and Wang, 2004).
Feature selection with neural networks can be thought of as a special case of architectural pruning (Reed, 1993), where the input features are pruned rather than the hidden neurons. Su et al. (2002) attempted to determine the important input nodes of a neural network based on the sum of absolute multiplication values of the weights between the layers. They (Su et al., 2002) proposed an algorithm to remove unimportant input nodes from a trained back-propagation neural network (BPNN). The essence of this method is to compare the multiplication values of the weights between the input-hidden layer and the hidden-output layer. Only the multiplication weights with large absolute values are kept and the rests are removed. The equation for calculating the sum of absolute multiplication values is defined as follows.
∑
× = j jk ij i W V Node (2.14)where W is the weight between the ith input node and the jth hidden node, and ij V jk is the weight between the jth hidden node and the kth output node. Then, we must set a threshold to remove the irrelevant input nodes. The threshold should be determined by the user to obtain a suitable number of input nodes. Unfortunately, the training of neural networks when using imbalanced data is very slow (Bruzzone and Serpico, 1997).
Another common understanding is that some learning algorithms have built-in feature selection, for example, ID3 (Quinlan, 1986), FRINGE and C4.5 (Quinlan, 1993). Almuallim and Dietterich (1994) suggested that one should not rely on ID3 or FRINGE to filter out irrelevant features. There are some cases in which ID3 and FRINGE miss extremely simple hypotheses. In addition, the negative examples of imbalanced data might be removed in the pruning phase of the tree construction.
In other words, when faced with imbalanced data, the performance of feature selection tools drops significantly (Akbani et al., 2004). Pendharkar et al. (1999) mentioned that the ratio of the number of objects belonging to positive and negative examples impacts upon effective learning. If the data set contains many positive examples and very few negative examples, there is a bias in the discriminant function that the technique will identify, and it therefore follows that this bias results in a lower reliability of the technique. An and Wang (2001) suggested to balance the data by sampling. However, this is sometimes not feasible due to there being so few negative examples.
CHAPTER 3
PROPOSED GRANULAR COMPUTING MODELS
In this chapter, we propose two kinds of GrC model, “Knowledge Acquisition via Information Granulation” (KAIG) model and IG based method, to tackle class imbalance problems. The KAIG model is suitable for dealing with discrete data and the IG based method is designed for continuous data. These two approaches can improve classification performance by controlling the reduction of unnecessary details.
In both of proposed models, Fuzzy ART (Adaptive resonance theory) neural network is utilized to construct IGs. The two indexes, the homogeneity index (H-index) and the undistinguishable ratio (U-ratio), are developed to determine a suitable level of granularity. In KAIG model, the concept of “sub-attributes” is presented to describe IGs and tackle the overlapping among granules. In IG based method, we propose three strategies which utilize different data characteristics and their combinations to represent IGs.
3.1 Construction of Information Granules
In this study, the Fuzzy ART is utilized to construct IGs. ART is a well established neural network theory developed by Carpenter et al. (1991). The ART network is also a famous method of clustering. Instead of clustering by a given number of clusters, it assigns patterns onto the same cluster by comparing their similarity. The detailed algorithm of Fuzzy ART can be found in (Serrano-Gotarredona et al., 1998).
The major difference between ART and other unsupervised neural networks is the so called vigilance parameter (ρ) which is viewed as a granularity and can be adjusted by the users to control the degree of similarity of patterns placed on the same cluster. In an ART, the degree of similarity between a new pattern and a stored pattern is defined. This similarity, compared to ρ, is a measure to ensure whether the new pattern is properly classified or not. The other unsupervised learning neural networks which do not implement vigilance may cause a significantly different input pattern to be forced into an inappropriate cluster. In contrast to some other cluster methods, an ART network will not automatically force all input vectors onto a cluster if they are not sufficiently similar. This is the reason why the ART network is employed in this study to construct the IGs.
There are three similar ART architectures, namely ART 1, ART 2, and Fuzzy ART. ART 1 is designed for binary-valued input patterns, and ART 2 is for continuous-valued patterns. Fuzzy ART is the most recent adaptive resonance framework that provides a unified architecture for both binary and continuous valued inputs. There are several factors that motivated us to use Fuzzy ART, and they are as follows (Burke and Kamal, 1995):
(1)Unlike ART1, Fuzzy ART does not require a completely binary representation of the parts to be grouped. In addition, Fuzzy ART possesses the same desirable stability properties as ART1 and a simpler architecture than that of ART2.
(2)ART2 can experience difficulty in achieving good categorizations if the input patterns are not all normalized to a constant length. However, such normalization can possibly destroy valuable information. Besides, there is a serious dependency of classification results in the case of ART1 on the sequence of input presentation. As a result, the Fuzzy ART network is employed to construct IGs in this study.
3.2 Selection of Granularity
Selecting an appropriate size of IGs is a difficult task. Enough background knowledge is required to determine how similar objects should be gathered together to form one IG. An objective index is needed to select the appropriate similarity of granules. We propose H-index and U-ratio to solve this problem.
The basic assumption of the H-index is that the classes of objects should be equal if their values of attributes are sufficiently similar. This implies that we always make the same decision under a similar condition. Because we form granules by the similarity of objects, the objects in the same granule should have the same class. The H-index is used to measure the consistency of the class of the objects in one IG. The H-index is defined as m n i index H − =
∑
m (3.1)where n represents the number of all objects in one granule, m is the number of all IGs and i is the amount of objects possessing the majority class.
For example, Table 3.1 shows one IG involving five objects (n=5). There are 4 condition attributes (namely A, B, C and D) in the iris data. The decision attribute (class) of the first 4 objects is “versicolor”, but the last one has a different decision attribute, “setosa”. In this example, “versicolor” is the majority class and i=4. The H-index of this IG is 5 4 . Condition attributes A B C D Decision attribute (Classes) 5.8 2.7 4.1 1 versicolor 6.2 2.2 4.5 1.5 versicolor 5.6 2.5 3.9 1.1 versicolor 5.9 3.2 4.8 1.8 versicolor 5 3.3 1.4 0.2 setosa
Another index for selecting similarity is the U-ratio. In the preceding example, “versicolor” is the majority of the classes. Therefore, it is assigned to be the class of this IG. If there is another granule described as Table 3.2, and we are unable to distinguish the class of the IG, then we call that granule an “undistinguishable granule.” The U-ratio is defined as
m u ratio
U − = (3.2)
where u represents the number of undistinguishable granules and m represents the quantity of all granules.
This index is to calculate the proportion of undistinguishable granules to all granules. If there are ten granules and two of them are undistinguishable granules, which means u is equal to 2 and m is equal to 10, then the U-ratio is equal to 0.2.
By using these two indexes, we also need a “granularity selection criteria” to determine the similarity of the IGs. In the present study, the larger the H-index the better it is, because it means that more objects in one IG possess the same class. There is no need to set up the index to a fixed value. The size of the index depends on the domain knowledge or how large an error you can tolerate. On the other hand, the U-ratio is the opposite. As far as the U-ratio is concerned, the smaller the better. It’s difficult to process an undistinguishable granule, so we need to view them carefully. However, we try to avoid this situation by setting the U-ratio as small as possible. In other words, if we select a specific similarity where the H-index is larger and the
Condition attributes
A B C D Decision attribute
5.4 2.2 3.9 1.2 versicolor
6.8 3.4 5.6 2.4 virginica
U-ratio is smaller, then this similarity is the best solution.
3.3 Representation of Information Granules
3.3.1 The Concept of “Sub-attributes”In KAIG model, we propose the concept of “sub-attributes” to represent IGs. First, we utilize hyperboxes to represent IGs (Pedrycz and Bargiela, 2002). For example, a hyperbox [b defined in ] R is fully described by its lower n (b−) and upper corner (b+), where b and − b are vectors in + n
R . An important and frequently used universal set is the set of all points in the n-dimensional space. This
set is denoted as R . Using n b and − b we can express the hyperbox as +
] , [ ]
[b = b− b+ . Consider two IGs (hyperboxes) A=[a] and B=[b] defined in R . 2 More explicitly, we follow a full notation [a]=[a−,a+] and [b]=[b−,b+]. These two granules are described as Table 3.3.
Table 3.3 Two IGs represented by hyperbox form Attributes IGs 1 X X2 A {a1−, a1+} {a2−, a2+} B {b1−, b1+} {b2−, b2+}
As Figure 3.1 shows, there are overlaps between two granules A and B. This makes it difficult to handle by knowledge acquisition tools. This is because most of knowledge acquisition algorithms are not designed to deal with IGs, especially when overlapping occurs between granules. Unfortunately, the overlapping situation always happens in real world. In this study, we introduce the concept of “sub-attributes” to tackle the problem of overlaps between granules.
(attribute 1), the overlapping part of two granules are separated into overlapping part ([b1−,a1+]) and non-overlapping parts ([a1−,b1−] & [a1+,b1+]). These sub-intervals,
] ,
[a1− b1− , ][b1−,a1+ & [a1+,b1+], are named as X11, X12, X13 which are so called “sub-attributes.” The binary variable which is employed to be the values of sub-attributes represents whether an IG contains these sub-intervals or not. The results of rewriting the IGs by using sub-attributes can be found in Table 3.4. We divide the original attribute X1 into sub-attributesX11, X12, X13; and attribute X2 into
21
X , X22, X23. Then, these two granules are rewritten by replacing the original attributes with sub-attributes. By introducing the concept of sub-attributes, we can easily extract knowledge from the granules even if the overlapping situation always exists.
Figure 3.1 The overlap between IGs
Table 3.4 The IGs with sub-attributes
Original attributes X1 X2 11 X X12 X 13 X21 X22 X23 Sub-attributes IGs [ , ] 1 1 − − b a [b1−,a1+] [a1+,b1+] [a2−,b2−] [b2−,a2+] [a2+,b2+] A 1 1 0 1 1 0 B 0 1 1 0 1 1 B A − 1 b − 1 a a1+ b1+ X1 2 X + 2 b + 2 a − 2 b − 2 a
The concept of “sub-attributes” can maintain the complete characteristics of data. The IGs with addition of sub-attributes are suitable for all knowledge acquisition algorithms. It is not required to adjust the computational architecture of these algorithms. However, too many sub-attributes may be generated in the situation of natural overlapping which the values of the condition attributes are continuous and diverse. Therefore, as we often do in data preparation phase of data mining, we suggest descretizing data before implementing KAIG model to control the number of sub-attributes.
3.3.2 Using Data Characteristics to represent IGs
As mentioned above, too many sub-attributes will increase computational complexity. In order to avoid this situation, we propose another idea which uses data characteristics to describe IGs. Unlike “sub-attributes” which use intervals to represent IG, we utilize different data points such as mean, median, maximum, minimum, and quartiles to describe IGs in IG based method. Three IG representation strategies are provided. In strategy 1, we utilize single value, mean and median, to describe IGs. The strategy 2 uses double-value combinations of data characteristics, Q1+Q3 and Maximum+Minimum. In strategy 3, we employ triple-values combinations, Q1+Median+Q3 and Maximum+Mean+Minimum.
3.4 Proposed Methodologies
This section summarizes the procedure of two proposed GrC models. First, we address how the IGs are formed from numerical data. Secondly, H-index and U-ratio are introduced to determine the level of granularity which can be used to construct IGs in Fuzzy ART. Then, we try to describe IGs and extract knowledge from them.
Figure 3.2 shows the proposed KAIG model. We summarized KAIG model by the following steps:
Step 1: Information Granulation
In step 1, we use Fuzzy ART to construct IGs. But, first thing we need to determine is to select the suitable level of granularity (vigilance). The IGs are formed by the selected granularity. The initial value of granularity is set 1 and then decrease gradually until find one satisfying criteria of H-index and U-ratio. The found suitable granularity is employed to construct IGs.
Step 2: Information Granules Representation
IGs are represented in a suitable form that can be handled by knowledge acquisition tools. As mentioned in section 3.2.3, these formed IGs are described in hyperboxes. Then, the sub-attributes are applied in these IGs to solve the problem and finally we can extract knowledge from these IGs.
Step 3: Knowledge Acquisition
After describing IGs appropriately and tackling the overlapping situation, we can Knowledge rules
Figure 3.2 Knowledge Acquisition via Information Granulation (KAIG) model Numerical data
Select the level of granularity
Information granules representation (Sub-attributes) Knowledge acquisition Check granularity by using H-index & U-ratio Not satisfied Satisfied Information granulation
use knowledge acquisition tools to extract knowledge rules from the granules. In this study, we will compare three famous data mining algorithms, C4.5, Rough sets and back-propagation neural network, to evaluate their effectiveness in KAIG model.
3.4.2 The IG based Method
In KAIG model, we use “sub-attributes” to describe IGs and solve the overlapping situation effectively. However, when dealing with continuous data, KAIG may generate so many sub-attributes that increase the computational complexity of knowledge acquisition algorithms. The same situation may occur while the discretization algorithms dividing the continuous attribute’s value into too many discrete intervals. Therefore, we propose the IG based method in this section.
In this method, the “information granulation” process is the same with KAIG model. Only one difference is the description of IGs. This method utilizes data characteristics to denote IGs without using sub-attributes. This IG based method follows the three steps described as bellow. We adopt three strategies which are listed in Step 2 to describe IGs. They are different combinations of data characteristics (mean, median, quartiles, maximum & minimum), single-value, double-value, and triple-value strategies. Then we can build a classifier from these data characteristics.
Step 1: Information Granulation
Step 2: IG Representation: Data Characteristics
Strategy 1- Single value: Mean, Median. Strategy 2-Double values: Max+Min, Q1+Q3
Strategy 3-Triple values: Max+Mean+Min, Q1+Median+Q3
Step 3: Knowledge Acquisition
NUMERICAL EXAMPLES
In this chapter, several data sets from UCI data bank are employed to illustrate our models and evaluate the effectiveness. Besides, some imbalanced data sets are provided to demonstrate the superiority of our methods in solving class imbalance class problem by using the indexes, Overall Accuracy, G-mean and Receiver Operation Characteristic (ROC) curve.
4.1 Performance Measures
Before implementing, we should discuss the effectiveness of performance index in class imbalance situation. The easiest way to evaluate the performance of classifiers is based on the confusion matrix described as Table 4.1. TP, FP, TN and FN are defined as bellows.
TP: the number of True Positive examples FP: the number of False Positive examples TN: the number of True Negative examples FN: the number of False Negative examples
Traditionally, the performance of a classifier is evaluated by considering the overall accuracy against test cases. However, when learning from imbalanced data sets, the measure is often not sufficient. For example, it is straightforward to create a classifier having an accuracy of 95% in a domain where the majority class proportion corresponds to 95% of the examples, by simply forecasting every new example as belonging to the majority class. Another fact is the metric considers different classification errors to be equally important. But as we know, a highly imbalanced class problem does not have equal error costs that favor the minority class, which is often the class of primary interest. Therefore, following the available studies (Batista
et al., 2004; Estabrooks et al., 2004; Guo and Viktor, 2004; Provost and Fawcett, 2001; Radivojac et al., 2004), we use Overall Accuracy (including Positive Accuracy and Negative Accuracy), G-Mean and Receiver Operation Characteristic (ROC) curve to evaluate our KAIG model. The G-mean is defined as
Accuracy Negative
Accuracy
Positive × (4.1) where Positive Accuracy and Negative Accuracy are calculated as TP/(FN+TP) and TN/(TN+FP). This measure is to maximize the accuracy on each of two classes while keeping these accuracies balanced. For instance, a high Positive Accuracy by a low Negative Accuracy will result in poor G-mean.
Table 4.1 Confusion matrix for binary class problem
Predicted Positive Predicted Negative
Actual Positive TP (the number of
True Positive)
FN (the number of False Negative)
Actual Negative FP (the number of
False Positive)
TN (the number of True Negative)
Another index is ROC curve, which is a technique for summarizing a classifier’s performance over a range by considering the tradeoffs between TP rate and FP rate. The TP rate and FP rate are calculated as TP/( FN+TP) and FP/( FP +TN). We use the term ROC space to denote the coordinate system used for visualizing classifier’s performance. In ROC space, TP rate is represented on the Y axis and FP rate is represented on the X axis. Each classifier is represented by the point in ROC space corresponding to its (FP rate, TP rate) pair. A ROC analysis also allows the performance of multiple classification functions to be visualized and compared simultaneously. The area under ROC curve (AUC) represents the expected performance as a single scalar. The AUC has a known statistical meaning: it equals to the Wilconxon test of ranks, and is equivalent to several other statistical measures for
evaluating classification and rank models (Hand, 1997).
4.2 Implementation of KAIG Model
4.2.1 Illustrative ExampleWe apply the KAIG model to the well-known data set, iris. It is comprised of 150 examples. We rearrange it randomly and divide it into two subsets, training set (100 objects) and test set (50 examples). We will illustrate the process of KAIG step by step.
Step 1: Information Granulation
We input the 100 training examples to the Fuzzy ART to form IGs. We set the parameters of Fuzzy ART α =0.01 and β =1. The number of IGs varies with the different level of similarity (vigilance). In this study, similarity value varies gradually from 1 to 0. The similarity 1 represents the numerical data. Next, we need to determine which similarity is suitable by the H-index and the U-ratio. The H-index is ’the larger the better’ and the U-ratio is ‘the smaller the better’. In Figure 4.1, we can find more than one similarity that satisfies this criterion. These similarities are 0.95-0.8 and 0.7-0.55, where H-index = 1 and U-ratio = 0. Their performances of classification, as described in Figure 4.2, are equal to each other. All classification accuracies are equal to 100%.
When the performances are equally good, the amount of granules becomes another criterion for selecting the similarity. In this study, we use IGs instead of numerical data to acquire knowledge and make decisions. If the smaller similarity is selected, the lesser the amount of granules will be dealt with. This smaller amount of granules may save some training time during the building of the model. Therefore, we select a similarity of 0.55 and the amount of granules is 3.
Step 2: Representing the IGs
We describe these 3 granules in hyperboxes form and they are shown in Table 4.2. L represents the lower bound of attribute values, and i U represents the upper i limit of attribute values in the i-th granule. Take granule #1 for example, it contains 33 objects. In condition attribute A, the minimum is 4.4 and the maximum is 5.7. We utilize the low limit and upper limit to describe all examples in the same one granule. Granule 1 possesses the same class, setosa. Granule 2 contains 33 examples which are of the same class, versicolor. Granule 3 is comprised of 34 examples which have the
Figure 4.2 The performance of classification (Iris data)
0% 20% 40% 60% 80% 100% 120% 1 0.9 0.8 0.7 0.6 0.5 0.4 Similarity C la ssi fi . A cc u
Figure 4.1 The H-index and U-ratio of the iris data
0 0.2 0.4 0.6 0.8 1 1.2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Similarity H-index U-ratio