演化式演算法應用於資料探勘之研究(II)

(1)

行政院國家科學委員會專題研究計畫成果報告

演化式演算法應用於資料探勘之研究(2/2)

計畫類別：個別型計畫

計畫編號： NSC93-2213-E-009-028-

執行期間： 93 年 08 月 01 日至 94 年 07 月 31 日

執行單位：國立交通大學工業工程與管理學系(所)

計畫主持人：沙永傑

計畫參與人員：劉正祥

報告類型：完整報告

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫可公開查詢

中華民國 94 年 9 月 29 日

(2)

目錄

頁次

目錄

...I

表目錄

... II

圖目錄

... II

中文摘要

...1

英文摘要

...2

前言與動機

...3

研究目的

...4

文獻探討

...4

研究方法

...10

結果與討論

...16

參考文獻

...18

計畫成果自評

...19

附錄一 Ant-Classifier 程式碼...20

出席國際學術會議心得報告

...44

(3)

表目錄

頁次

表一基因演算法應用於分類技術研究之彙整...5

表二 GA-Classifier 功能說明 ...11

表三 Ant-Classifier 功能說明...13

表四績效指標內容說明...13

表五資料庫說明...16

表六分類錯誤率...17

圖目錄

頁次

圖一基因演算法流程圖（王培珍, 1996）...7

圖二在現實世界裡螞蟻的行為模式 (Dorigo, 1996)...8

圖三 Ant System 應用在搜尋最佳解問題之執行流程...10

圖四 GA-Classifier 之執行流程 ...12

圖五 Ant-Classifier 演算法之內容...15

圖六實驗架構圖...17

(4)

中文摘要

在知識發掘與資料探勘的領域中，分類技術被視為一個重要的研究議題。分類的目

的在於定義每一個類別的特徵，透過訓練組的資料，建立一個判斷類別歸屬的模型，將

未歸類的資料分門別類。

一般來說，資料類型可分成兩大類：數值型、類別型。在真實生活中，數值型資料

之存在是相當普遍的，但是大多數的分類預測方法，只能處理類別型資料。針對數值型

資料大多數研究會在資料前置處理（preprocessing）過程中將其離散成類別型資料，之

後再利用資料探勘工具萃取分類規則。然而任何離散數值型資料的方法，都會造成原先

隱藏於資料當中的資訊流失。因此本研究希望能發展一套分類預測方法，能在不離散數

值型資料情況下進行分類規則的萃取。

本研究共分成兩個階段，第一階段將以基因演算法（GA）為主要研究工具，希望

藉由基因演算法的高效率與彈性，設計出能同時處理數值型與類別型資料的分類預測方

法。第二階段為針對螞蟻理論（Ant System）設計出能同時處理數值型與類別型資料的

分類預測方法，並比較兩種演化演算法在分類績效上的差異。

關鍵詞：資料探勘、離散化、基因演算法、螞蟻理論

(5)

英文摘要

Classification is one of the important issues in knowledge discovery and data mining.

The goal of classification is to define the characteristics for each class in order to predict if a

previously unknown object either belongs to the class or not.

In general, attribute data type can divide into two groups: numerical and categorical.

Numerical attributes are very common in real-world application. There exist a large number

of classification algorithms, which handle categorical attributes only. Therefore, the process

of the descretization is an essential task for data preprocessing in knowledge discovery in

databases (KDD). But during the discretization of numerical attribute, some information

hidden in the data set can be lost, we will proposed classification algorithms can extract

classified rules with numerical and categorical attributes simultaneously.

The first step of this project is based genetic algorithm to develop more effective

classifier (GA-Classifier), which handle numerical and categorical attributes simultaneously

than traditional classification algorithm—Decision Tree (C4.5). The second step of this

project is based a novel evolutionary algorithm (Ant System, AS) to develop more effective

and efficient classifier (Ant-Classifier) than GA-Classifier and Decision Trees.

(6)

前言與動機

為因應環境的變遷與商業競爭日益激烈，許多企業在面臨資訊科技發展的潮流下，

均希望透過資訊科技的力量為企業帶來更多的競爭優勢。然而當企業引進資訊技術來有

效率地收集資料時，卻發現無法有效率地從所收集數量龐大的資料中，發掘出有用的知

識與規則。因此企業的焦點逐漸轉變到如何有效的利用所收集到的資料獲取有用的資

訊。所以資料探勘的技術就逐漸被學術界與業界所重視。常見的資料探勘的定義有以下

數種。

Cabena(1997)定義資料探勘是將先前不知道，有效的資訊從龐大資料庫中萃取出

的過程，並提供給決策者作為決策依據。

Hall(1995)定義資料探勘乃是針對大量的資料，以全自動或半自動的方式進行分

析，找出有意義的關係或規則。

Berry(1997) 定義資料探勘結合許多不同的技術，如資料視覺化（ Data

Visualization）、機器學習（Machine Learning）、統計（Statistics）以及資料庫

（Databases）以便從龐大資料量中萃取以規則形式或其他模式所表達的知識。

資料探勘可以應用的領域幾乎涵蓋了各行各業，例如：生產製造、財務投資、信用

卡交易、服務業、…、等等。基本上，資料探勘是知識發現（Knowledge Discovery in

Databases, KDD）當中的一個步驟。知識發現大致可分成五大步驟[Bruha, 2000]：

1. 瞭解資料探勘所要應用的領域以及熟悉相關知識，並選擇所要使用的資料探勘

技術。

2. 進行目標資料的收集。

3. 進行資料前置處理。針對目標資料當中不一致或是遺漏值進行必要的處理。由

於所收集的目標資料當中，屬性的資料類型可分成兩大類：數值型、類別型。

在資料前置處理過程中會將把數值型資料進行離散化，以轉換成類別型資料，

以方便資料探勘工具的使用。

4. 將經過資料前置處理後的資料，以資料探勘工具進行知識萃取。

5. 資料後置處理。透過專家來驗證所萃取出的知識，並將有用的知識納入現有的

決策系統。

資料探勘的相關作業應用可區分成五種

[Collard, 2001] ：資料特徵描述

（Description）、分類（Classification）、關連分析（Association Discovery）、順序樣式分

析（Sequential Pattern Analysis）、迴歸（Regression）。

在本計劃中主要在探討資料探勘中的分類作業。在資料探勘的領域中，分類被視為

一個重要的研究議題。分類的目的在於定義每一個類別的特徵，透過訓練組的資料，建

立一個判斷類別歸屬的模型，將未歸類的資料分門別類。分類作業所萃取出的知識，其

(7)

表達形式通常為

IF-THEN 規則，表達如下所示。

IF <先決條件式> THEN <類別歸屬>

先決條件式由數個條件子（terms）所組成，每一個條件子利用邏輯運算符號（AND）

連結。每一個條件子由三項資訊組成：屬性、比較運算符號、屬性值，例如：<性別=

男性>。類別歸屬用來預測符合先決條件式的資料其應屬類別值。從使用者的觀點而言，

IF-THEN 的知識表達形式比較簡單易懂，較能讓知識使用者所接受。

一般來說，進行分類作業時多數的分類預測方法，針對數值型資料會在資料前置處

理（preprocessing）過程中將其離散成類別型資料，之後再利用資料探勘工具萃取分類

規則。然而任何離散數值型資料的方法，都會造成原先隱藏於資料當中的資訊流失。因

此本研究希望能發展一套分類預測方法，能在減少資訊流失的前提下離散數值型資料，

進行分類規則的萃取。

研究目的

本計劃目的在於利用演化式演算法來發展分類預測方法，能同時處理數值型與類別

型資料，在減少資訊流失的前提下離散數值型資料，並進行分類規則的萃取。本計劃共

分成兩個階段，第一階段將以基因演算法（GA）為主要研究工具，希望藉由基因演算

法的高效率與彈性，設計出分類績效較傳統分類工具—決策樹(Decision Tree, DT)好的分

類器。第二階段將根據相同概念以螞蟻理論（ACO）設計出能同時處理數值型與類別型

資料的分類器，並比較兩種演化演算法在分類績效上的差異。

文獻探討

目前已有許多機器學習(Machine Learning, ML)工具被應用於分類技術，如：決策

樹、類神經網路、基因演算法。決策樹是目前最常被使用於分類作業的工具，其優點在

於所產生的規則容易被人們所接受與解釋，缺點在於無法偵測與利用有交互作用的屬性

[Clare, 2000]。同時決策樹針對數值型資料必須在知識發現（Knowledge Discovery in

Databases, KDD ）的資料前置處理（ preprocessing ）過程中使用離散工具 (C4.5

Discretization)將其離散成類別型資料，之後再利用演算法(C4.5)進行規則萃取。然而任

何離散數值型資料的方法，都會造成原先隱藏於資料當中的資訊流失[Bruha, 2000]，所

以應在不離散數值型資料情況下進行規則萃取。另外，類神經網路雖能同時處理數值型

資料與類別型資料，但是其計算過程被視為黑箱(black box)且輸出的結果難以被人所解

釋。至於基因演算法具有輸出結果容易被解釋以及有全域搜尋最佳解能力的優點，但缺

點是計算時間較長。因此本計劃將基於演化演算法發展有效率的分類預測方法，能在不

離散數值型資料情況下進行分類規則的萃取，具同時處理數值型與類別型資料的能力。

(8)

在過去有許多研究文獻利用基因演算法發展分類器

[Congdon, 2000][Bandar,

1999][Lopes and Pozo, 2001][Fu and Mae, 2001][Pozo and Hasse, 2000][Shin and Lee,

2002][Bruha, 2000][ Noda et al., 1999][ Fidelis et al., 2000]，相關文獻證明利用基因演算法

進行分類其分類正確率優於決策樹。表一彙整近年來使用基因演算法求解分類問題之研

究，並說明各研究中針對數值型資料處理方式以及允許建構於規則當中的比較運算符號

與邏輯運算符號。從表一可以了解利用基因演算法應用於分類問題時，早期對數值型資

料均是加以離散化，同時允許建構於規則內的比較運算符號只限等於（＝）

，邏輯運算

符號亦只限交集（AND）。近年來逐漸有學者發表可以同時處理數值型資料與類別型資

料的基因演算法，同時所能建構於規則當中的比較運算符號也較以往來的多。至於邏輯

運算符號還是只允許

AND 符號。

表一基因演算法應用於分類技術研究之彙整

作者

年代

數值型資料處

理方式

比較運算符號

邏輯運算符號

備註

Noda et al.

1999 無數值型資料

= AND --

Bruha et al.

2000

離散化

= AND --

Congdon 2000

離散化

＝

AND --

數值型

資料

類別型

資料

Fidelis et al.

2000

不離散化

≧、＜＝、≠

AND --

數值型

資料

類別型

資料

Pozo and Hasse 2000

不離散化

≦、≧

＝

AND --

數值型

資料

類別型

資料

Shin and Lee

2002

不離散化

＜、≧ ＜、≧

AND

限制條件子數

目為

5 個

經由表一可以了解使用基因演算法所能產生的分類知識表達形式(IF-THEN)其變化

性較少，因此第一年計劃將針對基因演算法發展一套有效率、符號變化性較多的分類

器，並能同時處理數值型與類別型資料。

基因演算法（genetic algorithm, GA）最早由 John Holland 等人在 1975 年首度發表，

但直到

1980 年代後才逐漸有較多理論與應用之發展。GA 係將所有搜尋的參數轉換成

另一種有限長度的表示式子（字串），再利用遺傳運算元產生新的下一代，配合評估的

標準使子代具有比母代更好的表現。應用

GA 求解最佳化問題，須將問題目標轉化成對

應的函數，稱為合適度函數（fitness function）

；合適度函數代表系統對環境的適應能力，

相當於系統的性能指標。GA 會將每一代中合適度高的解依據機率複製到下一代，再經

(9)

過演化(evolution)、突變(mutation)與交配(crossover)等等運算去產生合適度更高的下一

代，持續此反覆過程便可以逐步找到近似最佳解。這種演算的方式正與大自然的生物生

存特性相似，產生更具適應能力的下一代以求在外部環境下能延續族群。

演算法中一開始會產生一個起始集合，稱為母體（Population）

，集合內包含有

N 個

染色體（Chromosome），每一個染色體由許多的基因（Gene）所組成，每一個基因就代

表一個自變數。因此，每一個染色體就代表一組解。

GA 會使用適應函數（Fitness function）

來評估每一個解的品質。在每一個世代中，GA 會根據解的品質選取適應函數較佳的染

色體放入交配池進行複製，再經由交配及突變的過程以產生下一世代的染色體。在運算

的過程中，有些參數必須是先設定，如：族群大小（Population Size）

、交配機率（Crossover

Rate）

、突變機率（Mutation Rate）

、適應函數之設計、終止條件（Stopping Conditions）。

其中，族群大小的設定會影響演化的結果，族群太小則收斂較早，較難達到預期成果；

族群太大則會消耗較長的計算時間。交配機率太高則會需要較長的計算時間；交配機率

太低，則演化收斂較快。突變機率太高則會演化過程與隨機搜尋一樣；交配機率太低，

則演化過程中將少有新物種進入族群。適應函數的設計必須根據求解的題型來設計，它

必須具有反應出不同染色體間品質的差異，也必須能將表現不佳的染色體淘汰地能力。

圖一為

GA 應用在搜尋最佳解問題的執行流程[王培珍, 1996]，以下針對其流程做

說明：

(1) 定義合適度函數／編碼。

(2) 產生起始解。

(3) 評估目標值。

(4) 如果已達結束條件則停止，反之則執行步驟(5)。

(5) 遺傳運算元之運算。

(6) 產生新群體後，執行步驟(3)。

(10)

結束

ye s

n o

新一代群體

(N e w G e n e ra tio n )

定義合適度函數

(Fitn e ss fu n c tio n )

編碼 (E n c o d in g )

產生起始群體

(In itia l P o p u la tio n )

P (t), t= 0

評估合適度

(E v a lu a tio n P (t))

是否收斂或

已達結束條件

再

生

轉

變

交

換

t = t + 1

遺

傳

運

算

元

圖一基因演算法流程圖（王培珍, 1996）

螞蟻理論是由

M. Dorigo 在 1996 首次發表應用求解旅行者推銷問題（Traveling

Salesman Problem, TSP），此理論目前已應用於其他 NP-Hard 問題，如：非對稱式旅行

者推銷問題（Asymmetric Traveling Salesman Problem, ATSP）

、二次規劃問題（Quadratic

Assignment Problem, QAP）、零工式工廠排程問題（Job Shop Scheduling Problem, JSP）

以及車輛途程問題（Vehicle Routing Problem, VRP）、資料探勘(Data Mining, DM)等等。

螞蟻理論主要是根據大自然界中螞蟻尋找食物方法而演化而成。自然界的螞蟻是接

近全盲的一種生物，在尋找食物的過程中螞蟻之間是利用費洛蒙進行溝通。螞蟻在行進

中會遺留費洛蒙給其他螞蟻作為路徑選擇的依據，隨著越來越多螞蟻走過相同路徑時，

此一路徑的費洛蒙量也隨之增加，相對地其他路徑上的費洛蒙物質量就會逐漸被蒸發，

費洛蒙量越多則螞蟻選擇此一路徑機會就越高，相對地其他路徑被選擇機會就會減少，

到最後多數螞蟻搜尋食物的路徑就會收斂至單一的搜尋路徑。此一路徑即為螞蟻巢穴與

食物來源之間最短路徑。以圖二為例，E 為螞蟻巢穴所在地，A 為食物所在的地點。螞

蟻不停地在

AE 路徑上搬運食物 (圖二 (a))。如果在 AE 路徑上放置一個障礙物，則螞

(11)

或巢穴

E) 。在障礙物放置初期，兩側的路徑上並沒有任何費洛蒙的殘留物，所以兩側

路徑中任一側被螞蟻選擇通過的機率是相同的 (圖二(b))。由於路徑 ACE 的距離較短，

所需的時間亦較少，因此路徑上所留下的費洛蒙被蒸發的量較少。當後面有螞蟻要再通

過時，路徑

ACE 上殘留較多的費洛蒙，所以螞蟻選擇路徑 ACE 的機率較高。久而久之

大部份的螞蟻都會選擇此一較短的路徑通過 (圖二 (c))。

圖二在現實世界裡螞蟻的行為模式 (Dorigo, 1996)

目前為止，已有許多學者提出數種不同的螞蟻演算法，除了所要求解的問題不同

外，主要還是在討論如何設計不同的產生問題解以及費洛蒙更新的方法。在這個章節將

以

TSP 問題為例，簡略介紹本計畫內所使用的螞蟻系統 (Ant System, AS)。在做進一步

介紹之前，先詳列各數學符號的定義：

1. τ

_ij

(t

)

：在第

t 次循環中，存在於節點

i 及節點 j 之間費洛蒙量。

2. η ：除了費洛蒙之外，提供給螞蟻選擇行走路徑之其它資訊。在 TSP 問題中可以

_ij

將它定義為

ij

d

1 ，其中

d

_ij

表城市

i 和城市 j 之間的距離。

3. ρ

：每個循環間費洛蒙蒸發的比率。

4. p

_ij

k

(

t

)

：在第

t 次循環內，螞蟻

k

行進至節點

i 時，選擇前往節點 j 的機率。

5. N ：當螞蟻

_i

k

行進到節點

i 時，可選擇作為下一步目的地的節點集合。

6. t ：第幾次循環。

(12)

7. m

_ant

：每個循環所使用的螞蟻數量。

8. α ：螞蟻隨機選擇路徑時賦予費洛蒙的權重。

9. β

：螞蟻隨機選擇路徑時賦予其他資訊的權重。

螞蟻系統（Ant System, AS）是最早被提出的蟻群演算法之一，系統內前進某一節

點的機率如公式所示。當螞蟻於節點

i 時，選擇往節點 j 前進的機率和路徑 ij 上的費洛

蒙量及其他(

[

τ

_ij

(

t

)]

α

[

η

_ij

]

β

)有關。當

[

τ

_ij

(

t

)]

α

[

η

_ij

]

β

愈大，則往節點

j 前進的機率就愈大。

其中α 、

β

分別為對

τ

_ij

及對

η

_ij

的權重，當α 愈大，表示比較以

τ

_ij

的大小來選擇路徑；

當

β

愈大，表示比較以

η

_ij

大小選擇路徑。

∑

∈

=

k i

N

l

il

ij

k

ij

t

p

_α

_β

β

α

η

τ

η

τ

]

[

)]

(

[

]

[

)]

(

[

)

(

,

j

∈

N

_i

k

每經過一個循環，螞蟻系統會更新所有的路徑上所遺留下的費洛蒙。此次循環內，

螞蟻所走的路徑愈短，增加的費洛蒙就愈多；越長，則增加的費洛蒙越少。更新的方式

如公式所示。

)

(

)

(

)

1 (

)

1 (

1 t

t

ant

m

k

ij

∑

=

Δ

+

⋅

−

=

+

ρ

τ

⎪⎩

⎪

⎨

⎧

=

Δ

otherwise

k

ant

by

used

is

j

i

arc

if

t

f

t

k

ij

0 )

,

(

)

(

/

1 )

(

τ

圖三為

Ant System 應用在搜尋最佳解問題的執行流程，以下針對其流程做說明：

(1) 演算法初始化。

(2) 根據費洛蒙量機率性挑選未在禁制串列中的進展。

(3) 如果蟻群禁制串列以飽和則進行步驟(4)，反之則執行步驟(2)。

(4) 計算解品質。

(5) 根據解品質，比例性更新費洛蒙。

(6) 如果已達結束條件則停止，反之則執行步驟(2)。

(13)

演算法初始化

根據費洛蒙量，機率性地

挑選未在禁制串列中的進

展，作為螞蟻的下一進展

（MOVE）

蟻群的禁制串列

容量是否飽和？

計算解品質

是否收斂或已

達結束條件？

YES

NO

YES

結束演算法並

輸出最佳解

根據解的品質，比例

性地更新解中的每一

進展的費洛蒙量

圖三 Ant System 應用在搜尋最佳解問題之執行流程

其中首次應用螞蟻理論於資料分類的是

Parpinelli 等人，不過 Parpinelli 等人針對數

值型資料乃是利用離散工具(C4.5 Discretisation)將其離散成類別型資料，再進行規則萃

取，允許使用的運算符號只限等於(＝)以及交集(AND)。因此在本計畫第二年度將根據

之前相同的概念，針對螞蟻發展一套有效率、符號變化性較多的分類器，並能同時處理

數值型與類別型資料。

研究方法

第一年計劃將以基因演算法作為主要研究工具，發展以基因演算法為基礎之分類器

（GA-Classifier）

。此一分類器針對不同類型的資料具有不同的處理方式。針對數值型資

(14)

料，GA-Classifier 將不執行離散化步驟，希望能保有原始資料內所隱藏的資訊；針對數

值型資料將可擁有小於等於（≦）以及大於等於（≧）的比較運算符號，而類別型資料

將可擁有小於等於（≦）

、大於等於（≧）、等於（＝）以及不等於（≠）的比較運算符

號；至於邏輯運算符號部分將可擁有交集（AND）與聯集（OR）符號的選擇。希望透

過較多的比較運算符號與邏輯運算符號的選擇，讓多個規則能結合成單一規則，減少產

生規則的數目，以方便決策者使用。表二內描述

GA-Classifier 所具有的功能。

表二 GA-Classifier 功能說明

演算法

數值型資料處

理方式

比較運算符號

邏輯運算符號

數值型

資料

類別型

資料

GA-Classifier

不離散化

≦、≧

＝、≠

AND、OR

在

GA-Classifier 內針對連續型屬性產生規則子的方式，先隨機產生一個基點（base

point）以及運算符號，並在 GA-Classifier 設定之搜尋範圍（search domain）隨機產生一

個區間值（interval），將基點加上區間值即為該規則子之界限值（bound value），例：

variable(x) ≦ base point + interval。

圖四表示我們所發展的

GA-Classifier 的執行流程。GA-Classifier 的第一個步驟將隨

機產生一個起始世代，起始世代內包含

IPopSize 個個體，每一個個體代表一條 IF-THEN

規則。針對這

30 個個體衡量其績效表現，根據其績效表現產生下一個新的世代。接下

來針對新世代內的的個體，進行規格修剪（rule pruning），去除無意義的規則子，以減

少資料收集與判斷成本。判斷是否以達到停止條件，若有，則輸出目前之最佳解；若無，

則進行下一步驟。在下一步驟中，若目前的最佳解經過預設的世代數，卻仍未有改善時，

則會進行強化步驟（Intensification）。在強化過程中，世代內個體數目（PopSize）將會

減少；針對連續型資料的搜尋範圍（search domain）亦會縮小；針對基因演算法內的突

變機率（

PMut）也會減少。遞減公式如下所示，其中ρ

red

表示一個使用者自訂遞減參

數；Δ

PopSize 表示此一世代內所應遞減之染色體數目；IPMut 表示初始設定之突變機

率；

Nb

red

表示目前已遞減的次數。

red old new

Search

domain

Search

ρ

_

=

PopSize

new

₌

old

₋

_Δ

)

(

*

_EXP

_Nb

red

IPMut

(15)

產生起始世代 initial population 衡量每一個individual的績效表現產生新的世代 1. selection 2. crossover 3. mutation 4. 菁英策略根據individual 的績效表現，啟動local agent進行 rule pruning 判斷是否抵達停止條件？輸出最佳解判斷是否已經過預設世代數，而目前的最佳解仍未改善？ Intensification 1. 減少neighborhood 距離 2. 減少population size 3. 減少突變機率 No Yes Yes No

圖四 GA-Classifier 之執行流程

本計劃主要利用爪哇語言(JAVA)設計 GA-Classifier，並將 GA-Classifier 分類績效與

傳統分類工具決策樹進行比較。

第二年計劃將以蟻群演算法作為主要研究工具，設計一個

Ant-Classifier。此一分類

器具有與

GA-Classifier 相同的功能，如表三所示。針對數值型資料，GA-Classifier 將不

執行離散化步驟，希望能保有原始資料內所隱藏的資訊；針對數值型資料將可擁有小於

等於（≦）以及大於等於（≧）的比較運算符號，而類別型資料將可擁有小於等於（≦）

、

(16)

大於等於（≧）

、等於（＝）以及不等於（≠）的比較運算符號；至於邏輯運算符號部分

將可擁有交集（AND）與聯集（OR）符號的選擇。希望透過較多的比較運算符號與邏

輯運算符號的選擇，讓多個規則能結合成單一規則，減少產生規則的數目，以方便決策

者使用。

表三 Ant-Classifier 功能說明

演算法

數值型資料處

理方式

比較運算符號

邏輯運算符號

數值型

資料

類別型

資料

Ant-Classifier

不離散化

≦、≧

＝、≠

AND、OR

圖五描述了

Ant-Classifier 演算法的內容，每一隻螞蟻所行走的路徑代表一條

IF-THEN 規則。在 Ant-Classifier 內針對連續型屬性產生規則子的方式，先隨機產生一

個基點（base point）以及運算符號，並在 Ant-Classifier 設定之搜尋範圍（search domain）

隨機產生一個區間值（interval），將基點加上區間值即為該規則子之界限值（bound

value），例：variable(x) ≦ base point + interval。Ant-Classifier 評判個體間的績效表現，

所使用的績效指標為

Q，其指標內容如表四所示。另外，Ant-Classifier 演算法中亦會進

行規格修剪（rule pruning）

，去除無意義的規則子，以減少資料收集與判斷成本。另外，

本計畫在

Ant-Classifier 中加入交配（Crossover）與突變（Mutation）運算，用以增加解

演化的多樣性。在突變的過程中，針對連續型屬性，Ant-Classifier 根據突變機率調整界

限值，調整方式如下所示。其中

Δ

( R

T

,

)

表示區間值調整量；

R 表示搜尋範圍；γ表示隨

機變數從[0..1]；T 表示預設最大的世代數；b 是一個系統參數，用以決定非線性程度。

當

Δ

( R

T

,

)

產生完畢後，將原先的區間值加上

Δ

( R

T

,

)

，即成為新的界限值。

)

1 (

)

,

(

_T

_R

₌

_±

_R

_∗

₋

(1−T)b

Δ

γ

表四績效指標內容說明

Yes No

Yes

TP: True Positive

FN: False Negative

No

FP: False Positive

TN: True Negative

Predicted

Class

Actual

(17)

Ant-Classifier 與 GA-Classifier 不同之處在於：(1) Ant-Classifier 具有正向回饋

（positive feedback）之功能，它在演化的過程中所根據演化結果增加或減少特定之解空

間發展的機率；(2) 它指派特定數量的 local ants 進行規則的修剪。

本計劃將利用爪哇語言(JAVA)設計一個分類績效較傳統分類工具決策樹準確的

Ant-Classifier，並將 Ant-Classifier 分類績效與 GA-Classifier 以及傳統分類工具決策樹進

行比較。

(18)

TrainingSet={all training cases};

DiscoveredRuleList=[];

WHILE (TrainingSet > Max_uncovered_cases)

t=1;

j=1;

Create N regions;

Initialize all regions with the same amount of pheromone;

REPEAT

(1) Send “G” global ants for Crossover and Mutation;

a. Crossover:

1. Using normal crossover to exchange each categorical attributes.

2. Using crossover to exchange the base point of each continuous attributes.

b. Mutation:

1. Using general mutation to modify each categorical attribute.

2. Using mutation to randomly adding or subtracting a value to each

continuous attribute with a mutation probability. Mutation step size defined

as following:

_Δ

(

_T

,

_R

)

₌

_±

_R

_∗

(

1 ₋

_γ

(1−T)b

)

(2) The trail value of the newly created child regions is assigned a trail value lying

between the values of original parent regions that proportional by crossover point;

(3) Using G newly regions to update the G weakest regions;

(4) Select regions as per normalized trail value of regions and send “L” local ants to do

rule pruning;

(5) Calculate each Function Value (Q) of regions;

[ ]

0 ,

1 ,

∈

+

∗

+

=

Q

TN

FP

TN

FN

TP

Q

(6) Update trail value;

τ

_i

( )

t

+

1 =

τ

_i

( )

t

+

τ

_i

( )

t

∗

Q

(7) Pheromone evaporates for each trail;

τ

_i

( )

t

+

1 =

τ

_i

( )

t

∗

ρ

(8) Sort regions according to trail value;

(9) Choose the best rule Rt among all rules RN constructed by all the ants according to

trail value;

(10) If (Rt is equal ro Rt-1) Then

j=j+1;

Else

j=1;

End If

(11) t=t+1;

UNTIL (j>No_rules_coverge or t>No_generation)

Choose the best rule Rbest among all rules Rt constructed by all the ants;

Add rule Rbest to DiscoveredRuleList;

TrainingSet= TrainingSet -{set of cases correctly covered by Rbest};

END WHILE

(19)

結果與討論

本計劃主要利用爪哇語言

(JAVA) 設計一個分類績效較傳統分類工具決策樹

(Decision Tree)準確的 GA-Classifier 以及 Ant-Classifier。為了進一步驗證 GA-Classifier

以及

Ant-Classifier 分類的能力，本計劃首先採用四種常見的離散化技術（C4.5

discretization、Boolean reasoning algorithm、Entropy/MDL algorithm、Equal frequency

binning）搭配決策樹使用，並使用三個標準資料庫 Cleveland、Australian、Iris 進行測試，

以了解不同的分類器的分類結果。表五為三個標準資料庫的屬性資料說明。

在

GA-Classifier 內參數設定如下：IPopSize=30, FPopSize=10, ΔPopSzie=5, IPMut=0.9,

ρ

red

=2。在 Ant-Classifier 內參數設定如下：N=50, G-Ants=20, L-Ants=5, No_generation=100,

No_rules_coverge=10, R=0.5, ρ=0.9。

本計畫針對每一個標準資料庫使用

Five-Folds Cross-Validation 方法測試不同分類器

的分類結果，結果如圖六所示。表六列示各項離散化技術的平均分類錯誤率。從表六中，

我們可以得知使用不同離散化技術進行連續型資料的離散，離散後資料經由決策樹分析

後會得到不同分類結果。針對

Cleveland 資料庫，使用 Ant-Classifier 的分類表現最佳；

針對

Australian 資料庫，使用 Ant-Classifier 的分類表現最佳；針對 Iris 資料庫，使用

GA-Classifier 的分類表現最佳。這也說明了使用不同的離散化技術搭配決策樹會造成不

同程度的資訊流失，使用柔性演算法才能保有原有資料隱藏的資訊，達到較佳的分類結

果。總和來說，GA-Classifier 以及 Ant-Classifier 的績效表現均比其他方法來的好，其中

Ant-Classifier 的整體表現略佳於 GA-Classifier。未來的研究方向，由於基因演算法與蟻

群演算法在運算過程中需要較多的時間，因此本計畫將朝向提昇

GA-Classifier 以及

Ant-Classifier 的運算效率進行研究。

表五資料庫說明

資料庫

類別型屬性數

連續型屬性數

資料筆數

Cleveland 7

6

303 Australian 8

6

690 Iris 0

4

150

(20)

Testing Stage Training Stage Data sets Testing data (1/5 of data sets) Training data (4/5 of data sets) 5-fold cross validation

● Decision Trees with C4.5 discretisation ● Decision Trees with Equal frequency binning ● Decision Trees with Entropy/MDL algorithm ● Decision Trees with Boolean reasoning algorithm ● GA-Classifier

● Ant-Classifier ● Decision Trees with C4.5 discretisation

● Decision Trees with Equal frequency binning ● Decision Trees with Entropy/MDL algorithm ● Decision Trees with Boolean reasoning algorithm ● GA-Classifier

● Ant-Classifier

圖六實驗架構圖

表六分類錯誤率

資料庫

離散化技術

Cleveland Australian

Iris

Ant-Classifier

41.7% 11.5% 2.6%

GA-Classifier 42.1%

12.6%

2.4%

C4.5 discretisation

43.9%

13.8%

6.0%

Equal frequency binning

47.2%

12.2%

5.3%

Entropy/MDL algorithm

44.2%

14.9%

6.7%

Decision

Trees

(21)

參考文獻

Bandar, Z, H. Al-Attar and D. McLean, “Genetic algorithm based multiple decision tree

induction,” Proceedings of 6th International Conference on Neural Information

Processing, Piscataway, NJ, USA, pp.429-434 (1999).

Berry, M. J. A. and L. Gordon, Data Mining Techniques for Marketing, Sales, and Customer

Support, John Wiley and Sons, New York, NY (1997).

Bruha, I., P. Kralik and P. Berka, “Genetic learner: Discretization and fuzzification of

numerical attributes,” Intelligent Data Analysis, vol:4, no:5, pp.445-460 (2000).

Cabena, P., P. O. Hadjinian, R. Stadler, J. Vehees and A. Zanasi, Discoverying Data Mining

from Concept to Implementation, Prentice Hall, New York, NY (1997).

Collard, M, D. Francisci, “Evolutionary data mining: an overview of genetic-based

algorithms,

”

Proceedings of 8th International Conference on Emerging Technologies and

Factory Automation, Piscataway, NJ, USA, pp.3-9 (2001).

Congdon, C. B., “Classification of epidemiological data: a comparison of genetic algorithm

and decision tree approaches,” Proceedings of the 2000 Congress on Evolutionary

Computation, Piscataway, NJ, USA, pp.442-449 (2000).

Dorigo, M., V. Maniezzo and A. Colorni, “Ant system: optimization by a colony of

cooperating agents,” IEEE Transactions on Systems, Man, and Cybernetics—Part

B:Cybernetics, vol:26, no:1, pp.29-41 (1996).

Fidelis, M. V., H. S. Lopes and A. A. Freitas, “Discovering comprehensible classification

rules with a genetic algorithm,” Proceedings of the 2000 Congress on Evolutionary

Computation, Piscataway, NJ, USA, pp.805-810 (2000).

Fu, Z. and F. Mae, “A computational study of using genetic algorithms to develop intelligent

decision trees,” Proceedings of the 2001 Congress on Evolutionary Computation,

Piscataway, NJ, USA, pp.1382-1387 (2001).

Hall, C., “The devil’s in the details: techniques, tools, and application for database mining

and knowledge discovery part II,” Intelligent Software Strategies, vol:6, no:9, pp.1-16

(1995).

Lopes, F. M. and A. T. R. Pozo, “Genetic algorithm restricted by tabu lists in data mining,”

21st International Conference of the Chilean Computer Science Society, Los Alamitos,

CA, USA, pp.178-185 (2001).

Noda, E, A. A. Freitas and H. S. Lopes, “Discovering interesting prediction rules with a

genetic algorithm,” Proceedings of the 1999 Congress on Evolutionary Computation,

Piscataway, NJ, USA, pp.1322-1329 (1999).

Parpinelli, R. S., H. S. Lopes and A. A. Freitas, ”Data mining with an ant colony optimization

algorithm,” IEEE Transactions on Evolutionary Computation, vol:6, no:4, pp.321-332

(2002).

(22)

Pozo, A. R. and M. Hasse, “A genetic classifier tool,” Proceedings 20th International

Conference of the Chilean Computer Science Society, Los Alamitos, CA, USA, pp.14-23

(2000).

Shin, K. S. and Y. J. Lee, “A genetic algorithm application in bankruptcy prediction

modeling,” Expert Systems with Applications, vol:23, no:3, pp.321-328 (2002).

王培珍，應用遺傳演算法與模擬在動態排程問題之探討，中原大學工業工程研究所碩士

論文，1996.

計畫成果自評

本計畫主要利用爪哇語言

(JAVA)設計分類績效較傳統分類工具決策樹準確的

GA-Classifier 以及 Ant-Classifier，此二分類器分別是以兩種柔性演算法（Genetic

Algorithm and Ant System）所發展而成。這兩種分類器將能在不離散數值型資料狀況

下，同時處理數值型與類別型資料，從資料中萃取出有用的

IF-THEN 規則，協助管理

者進行決策。目前計畫已執行完畢，計畫結果也已達到預期目標。兩個以柔性演算法為

基礎之分類器的分類結果均較傳統離散化技術搭配決策樹來得好，其中

Ant-Classifier

表現更比

GA-Classifier 好。本計畫相關執行人員，針對此計畫成果自評滿意，此一研究

成果之學術價值非常適合在學術期刊發表，目前已開始著手撰寫學術論文。另外透過執

行此項國科會計劃，也使相關參與人員充分了解目前業界相當受重視的資料探勘技術以

及柔性演算法，培植參與人員未來的就業的實力，同時參與人員的程式撰寫能力也有顯

著地提升。

(23)

附錄一 Ant-Classifier 程式碼

import java.io.*; import java.util.*; import java.sql.*;

public class ant {

public static void main(String[] args) throws java.io.IOException { /* --- parameter setting --- */ String SourcePath,TestPath,temp,key,sql,sql_TP,sql_FP,sql_FN,sql_TN,sql_cut,sql_update,FieldName,sql_select,temp_term1,temp_term2,sql_opti mal,sql_rules; int i,k=0,k1=0,k2=0,l,AttributeCount,DynamicAttribute,NoRegion=50,j,crossover_point=0,selection_point1=0,selection_point2=0,pheromone_ point=0,b=1,a=0,c=0,d=0,p[]=new int[6],ant_L=5,f,h,g,pruning_point=0,term_count=0,temp_ID=0,remaining_records=0,end_count=0; /*variable p 代表 pruning point*/

int MaxCategory[]; Connection MyConn,MyConn1,MyConn2; Statement s,s2,s3,s_optimal,s_rules,s_cut; Random random; double RandomValue,RandomValue1,CategoryValue,step_size,R=0.5,m_rate_c=0.6,m_rate_n=0.6,global_Q,temp_Q=0,initial_Q; ResultSet rs,rs1,rs2,rs_optimal,rs_rules,rs_cut;

float Q,TP=0,FP=0,FN=0,TN=0,update_p,iteration=1,total_pheromone=0,cdf[]=new float[51],global_solution=0; boolean check=false,check1=false,check2=false;;

/* --- */

/* --- 輸入原始資料檔 --- */ InputStreamReader stdin=new InputStreamReader(System.in);

BufferedReader bufin=new BufferedReader(stdin); System.out.print("請輸入資料檔之絕對路徑："); SourcePath=bufin.readLine();

File SourceFile=new File(SourcePath);

/* --- */ /* --- 了解資料結構 --- */ System.out.print("請輸入資料檔中之屬性個數："); AttributeCount=Integer.parseInt(bufin.readLine()); System.out.print("請輸入資料檔中之數值屬性個數："); DynamicAttribute=Integer.parseInt(bufin.readLine()); MaxCategory=new int[(AttributeCount+DynamicAttribute+2)]; /*條件式結果亦要變動，尚未定義結果是否要變動*/

for (i=((DynamicAttribute*2)+1);i<=(AttributeCount+DynamicAttribute+1);i++) /* 因為要跟 cleveland.candidate 欄位序號相對應 */ { System.out.print("請輸入資料檔中之第"+(i-(DynamicAttribute*2))+"個類別屬性之最大類別碼："); MaxCategory[i]=Integer.parseInt(bufin.readLine()); } /* --- */

String term[][]=new String[AttributeCount+DynamicAttribute+2][2]; double delta_Q[]=new double[AttributeCount+DynamicAttribute+1]; String optimal_term[]=new String[AttributeCount+DynamicAttribute+1];

(24)

/* --- 建立資料庫與原始資料表與結果資料庫 --- */ try { Class.forName("org.gjt.mm.mysql.Driver"); MyConn=DriverManager.getConnection("jdbc:mysql://localhost/MySQL","bmw0818","gilber"); s=MyConn.createStatement();

k=s.executeUpdate("create database cleveland"); sql="create table cleveland.source (";

for (l=1;l<=DynamicAttribute;l++) /*限制兩個以上的數值屬性*/ { FieldName="a"+l+" DOUBLE,"; sql=sql+FieldName; } for (l=DynamicAttribute+1;l<=AttributeCount;l++) { FieldName="a"+l+" TINYINT,"; sql=sql+FieldName; } FieldName="a"+(AttributeCount+1)+" TINYINT"; sql=sql+FieldName; sql=sql+")"; k=s.executeUpdate(sql); System.out.println("原始資料庫建立成功"); /* ---將資料輸入至資料庫--- */ BufferedReader Source=new BufferedReader(new FileReader(SourceFile)); while ((temp=Source.readLine())!=null)

{ sql="";

sql="insert cleveland.source values ("+temp+")"; k=s.executeUpdate(sql); remaining_records=remaining_records+1; /*計算原始資料筆數*/ } /* --- */ /* ---建立規則資料庫--- */ sql="create table cleveland.rules (";

for (l=1;l<=(DynamicAttribute*2);l++) {

FieldName="a"+l+" FLOAT DEFAULT 99,"; sql=sql+FieldName;

}

for (l=(DynamicAttribute*2)+1;l<=(AttributeCount+DynamicAttribute);l++) {

FieldName="a"+l+" TINYINT DEFAULT 99,"; sql=sql+FieldName;

}

FieldName="a"+(AttributeCount+DynamicAttribute+1)+" TINYINT DEFAULT 99,"; sql=sql+FieldName;

sql=sql+"Pheromone FLOAT DEFAULT 99, Q FLOAT DEFAULT 99, ID INT PRIMARY KEY AUTO_INCREMENT, CAL CHAR(1) DEFAULT 'F')";

k=s.executeUpdate(sql);

System.out.println("規則資料庫建立成功");

(25)

sql="create table cleveland.candidate (";

for (l=1;l<=(DynamicAttribute*2);l++) /*限制兩個以上的數值屬性*/ {

FieldName="a"+l+" FLOAT DEFAULT 99,"; sql=sql+FieldName;

}

for (l=(DynamicAttribute*2)+1;l<=(AttributeCount+DynamicAttribute);l++) {

FieldName="a"+l+" TINYINT DEFAULT 99,"; sql=sql+FieldName;

}

FieldName="a"+(AttributeCount+DynamicAttribute+1)+" TINYINT DEFAULT 99,"; sql=sql+FieldName;

sql=sql+"Pheromone FLOAT DEFAULT 99, Q FLOAT DEFAULT 99, ID INT PRIMARY KEY AUTO_INCREMENT, CAL CHAR(1) DEFAULT 'F')";

k=s.executeUpdate(sql);

System.out.println("候選件資料庫建立成功");

/* --- */

int runs;

while (remaining_records>10 && end_count<2) /**---Ant-Miner 演算法開始啟動---**/ { check1=false; /* ---將起始解輸入至起始解資料庫--- */ random=new Random(); for (i=1;i<=NoRegion;i++) /*50 個候選解集合*/ {

sql="insert cleveland.candidate values (";

sql_TP="select count(*) from cleveland.source where "; sql_TN="select count(*) from cleveland.source where ("; for (l=1;l<=(DynamicAttribute);l++) { if (random.nextDouble()<0.5) /*以 0.5 機率產生起始解中的條件項*/ { RandomValue=random.nextDouble(); if (RandomValue<=0.5) { RandomValue=random.nextDouble(); sql=sql+RandomValue+",-0.5,";

sql_TP=sql_TP+"a"+l+"<"+RandomValue+" and "+"a"+l+">"+(RandomValue-0.5)+" and "; sql_TN=sql_TN+"a"+l+">"+RandomValue+" or "+"a"+l+"<"+(RandomValue-0.5)+" or "; } else { RandomValue=random.nextDouble(); sql=sql+RandomValue+",+0.5,";

sql_TP=sql_TP+"a"+l+"<"+(RandomValue+0.5)+" and "+"a"+l+">"+RandomValue+" and "; sql_TN=sql_TN+"a"+l+">"+(RandomValue+0.5)+" or "+"a"+l+"<"+RandomValue+" or "; } check1=true; } else { sql=sql+"99,99,"; } } for (l=(DynamicAttribute*2)+1;l<=(AttributeCount+DynamicAttribute);l++) {

(26)

if (random.nextDouble()<0.5) /*以 0.5 機率產生起始解中的條件項*/ { RandomValue=random.nextDouble()*(MaxCategory[l]+1); /*限定最大亂數產生值*/ CategoryValue=Math.floor(RandomValue); sql=sql+CategoryValue+","; sql_TP=sql_TP+"a"+(l-DynamicAttribute)+"="+CategoryValue+" and "; if (l==(AttributeCount+DynamicAttribute)) { sql_TN=sql_TN+"a"+(l-DynamicAttribute)+"<>"+CategoryValue+") and "; } else { sql_TN=sql_TN+"a"+(l-DynamicAttribute)+"<>"+CategoryValue+" or "; } check1=true; } else { sql=sql+"99,"; } } if (sql_TN.substring((sql_TN.length()-4),sql_TN.length()).equals(" or ")) { sql_TN=sql_TN.substring(0,(sql_TN.length()-4)); sql_TN=sql_TN+") and "; } RandomValue=random.nextDouble()*(MaxCategory[AttributeCount+DynamicAttribute+1]+1); /*隨機產生預測值 */ CategoryValue=Math.floor(RandomValue); if (check1==false) { Q=0; } else { sql_FP=sql_TP+"a"+(AttributeCount+1)+"<>"+CategoryValue; rs=s.executeQuery(sql_FP); rs.first(); FP=rs.getFloat(1); sql_TP=sql_TP+"a"+(AttributeCount+1)+"="+CategoryValue; rs=s.executeQuery(sql_TP); rs.first(); TP=rs.getFloat(1); sql_FN=sql_TN+"a"+(AttributeCount+1)+"="+CategoryValue; rs=s.executeQuery(sql_FN); rs.first(); FN=rs.getFloat(1); sql_TN=sql_TN+"a"+(AttributeCount+1)+"<>"+CategoryValue; rs=s.executeQuery(sql_TN); rs.first(); TN=rs.getFloat(1);

(27)

{ Q=0; } else { Q=(TP/(TP+FN))*(TN/(FP+TN)); } } sql=sql+CategoryValue+","+(1+Q)+","+Q+",null,'T')"; k=s.executeUpdate(sql); } MyConn2=DriverManager.getConnection("jdbc:mysql://localhost/cleveland","bmw0818","gilber"); /*專供寫入規則資料庫之用*/ /*--- */ /*---演化過程開始 --- */ for (iteration=1;iteration<=100;iteration++) { /*---剔除表現較差之解---*/ sql_cut="select ID from cleveland.candidate order by Q limit 20"; /*--前二十小資料*/ rs=s.executeQuery(sql_cut);

sql_cut="delete from cleveland.candidate where ID IN ("; while (rs.next()) { if (rs.isLast()==true) { sql_cut=sql_cut+rs.getString(1); } else { sql_cut=sql_cut+rs.getString(1)+","; } } sql_cut=sql_cut+")"; k=s.executeUpdate(sql_cut); /*---*/ MyConn1=DriverManager.getConnection("jdbc:mysql://localhost/cleveland","bmw0818","gilber"); s=MyConn1.createStatement();

sql_select="select * from candidate"; rs=s.executeQuery(sql_select); rs1=s.executeQuery(sql_select); for (l=1;l<=20;l++) /*產生 20 個新候選解*/ { selection_point1=(int) Math.floor(random.nextDouble()*30)+1; selection_point2=(int) Math.floor(random.nextDouble()*30)+1; while (selection_point2==selection_point1) { selection_point2=(int) (Math.floor(random.nextDouble()*30)+1); } crossover_point=(int) (Math.floor(random.nextDouble()*(AttributeCount-1))+1); pheromone_point=crossover_point; if (crossover_point<=DynamicAttribute) { crossover_point=crossover_point*2;

(28)

} else { crossover_point=crossover_point+DynamicAttribute; } rs.moveToInsertRow(); rs1.absolute(selection_point1); update_p=rs1.getFloat((AttributeCount+DynamicAttribute+2));

/*---作 crossover & mutation ---*/ for (j=1;j<=crossover_point;j++)

{

if (j<=DynamicAttribute*2) {

if (j%2==0 && random.nextDouble()<=m_rate_n && rs1.getFloat(j-1)!=99) { step_size=R*(1-Math.pow(random.nextDouble(),(1-(iteration/100))*b)); /*每一世代演化 100 次*/ rs.updateFloat(j,(float) (rs1.getFloat(j)+step_size)); } else { if ((rs1.getFloat(j))!=99) { rs.updateFloat(j,rs1.getFloat(j)); } else { if (random.nextDouble()<=m_rate_n && j%2!=0) { check2=true; rs.updateFloat(j,random.nextFloat()); if (random.nextDouble()<=0.5) { rs.updateFloat(j+1,(-1)*random.nextFloat()); } else { rs.updateFloat(j+1,random.nextFloat()); } } else { if (check2!=true) rs.updateFloat(j,99); } } if (j%2==0) check2=false; } } else { if (random.nextDouble()<=m_rate_c) { CategoryValue=Math.floor(random.nextDouble()*(MaxCategory[j]+1)); /*限定最大亂數產生值*/ while (CategoryValue==rs1.getFloat(j)) {

(29)

} rs.updateFloat(j,(float) CategoryValue); } else { if ((rs1.getFloat(j))!=99) { rs.updateFloat(j,rs1.getFloat(j)); } else { rs.updateFloat(j,99); } } } } rs1.absolute(selection_point2); for (j=(crossover_point+1);j<=(AttributeCount+DynamicAttribute+1);j++) { if (j<=DynamicAttribute*2) {

if (j%2==0 && random.nextDouble()<=m_rate_n && rs1.getFloat(j-1)!=99) { step_size=R*(1-Math.pow(random.nextDouble(),(1-(iteration/100))*b)); /*每一世代演化 100 次*/ rs.updateFloat(j,(float) (rs1.getFloat(j)+step_size)); } else { if ((rs1.getFloat(j))!=99) { rs.updateFloat(j,rs1.getFloat(j)); } else { if (random.nextDouble()<=m_rate_n && j%2!=0) { check2=true; rs.updateFloat(j,random.nextFloat()); if (random.nextDouble()<=0.5) { rs.updateFloat(j+1,(-1)*random.nextFloat()); } else { rs.updateFloat(j+1,random.nextFloat()); } } else { if (check2!=true) rs.updateFloat(j,99); } } if (j%2==0) check2=false; } } else

(30)

{ if (random.nextDouble()<=m_rate_c) { CategoryValue=Math.floor(random.nextDouble()*(MaxCategory[j]+1)); /*限定最大亂數產生值*/ while (CategoryValue==rs1.getFloat(j)) { CategoryValue=Math.floor(random.nextDouble()*(MaxCategory[j]+1)); } rs.updateFloat(j,(float) CategoryValue); } else { if ((rs1.getFloat(j))!=99) { rs.updateFloat(j,rs1.getFloat(j)); } else { rs.updateFloat(j,99); } } } } update_p=update_p*((float) pheromone_point/(float)

AttributeCount)+rs1.getFloat((AttributeCount+DynamicAttribute+2))*(1-((float) pheromone_point/(float) AttributeCount)); rs.updateFloat((AttributeCount+DynamicAttribute+2),update_p);

rs.updateString((AttributeCount+DynamicAttribute+5),"F"); rs.insertRow();

/* --- crossover & mutation ending --- */ }

/* ---計算新解之 Q 值,並更新所有解之 Pheromone & prunung--- */ rs.close();

s=MyConn.createStatement();

sql_select="select sum(Pheromone) from cleveland.candidate"; rs=s.executeQuery(sql_select);

rs.first();

total_pheromone=rs.getFloat(1);

sql_select="select * from cleveland.candidate"; rs=s.executeQuery(sql_select); s2=MyConn.createStatement(); s3=MyConn.createStatement(); /* ---決定哪一筆資料需進行 pruning--- */ cdf[0]=0; d=1; while (rs.next()) { cdf[d]=cdf[d-1]+rs.getFloat(AttributeCount+DynamicAttribute+2); //System.out.println(d+"-"+cdf[d]); d=d+1; } for (d=1;d<=ant_L;d++) { p[0]=0; f=1; check=false; while (check==false)

(31)

RandomValue1=random.nextDouble()*cdf[50]; //System.out.println(RandomValue1); for (f=1;f<=(NoRegion-1);f++) {

if (RandomValue1>cdf[f] && RandomValue1<cdf[f+1] && f!=p[d-1]) { p[d]=f; check=true; //System.out.println(f); } } } } /*---決定哪一筆資料需進行 pruning---end----*/ rs.beforeFirst();

for (d=1;d<=NoRegion;d++) /*while (rs.next())*/ {

rs.next();

temp_ID=rs.getInt(22); // System.out.println(temp_ID);

if (d==p[1] || d==p[2] || d==p[3] || d==p[4] || d==p[5]) /*doing rule pruning*/ {

if (rs.getString(AttributeCount+DynamicAttribute+5).equals("F")) /*需要計算 Q 值,再進行 pruning,並更新 pheromone*/

{

for (a=1;a<=(AttributeCount+DynamicAttribute+1);a++) /*---將資料讀入陣列---term[0][*]不使用---*/ { if (rs.getFloat(a)!=99) { term[a][0]=String.valueOf(rs.getFloat(a)); term[a][1]="T"; } else { term[a][0]="99"; term[a][1]="F"; } } /*---將資料讀入陣列---*/ /*判斷剩餘條件項數目*/ term_count=0; for (g=1;g<DynamicAttribute*2;g=g+2) { if (term[g][1].equals("T")) { term_count=term_count+1; } } for (g=(DynamicAttribute*2+1);g<=AttributeCount+DynamicAttribute;g++) { if (term[g][1].equals("T")) { term_count=term_count+1; } }

(32)

if (term_count>1) {

/*我覺得應該要從此處進行重複 pruning 程序*/ sql_TP="select count(*) from cleveland.source where ";

sql_TN="select count(*) from cleveland.source where ("; c=0; for (a=1;a<(DynamicAttribute*2);a=a+2) { if (term[a][1].equals("T")) { if (Float.parseFloat(term[a+1][0])<=0) { sql_TP=sql_TP+"a"+(a-c)+"<"+term[a][0]+" and "+"a"+(a-c)+">"+(Float.parseFloat(term[a][0])+Float.parseFloat(term[a+1][0]))+" and "; sql_TN=sql_TN+"a"+(a-c)+">"+term[a][0]+" or "+"a"+(a-c)+"<"+(Float.parseFloat(term[a][0])+Float.parseFloat(term[a+1][0]))+" or "; } else { sql_TP=sql_TP+"a"+(a-c)+">"+term[a][0]+" and "+"a"+(a-c)+"<"+(Float.parseFloat(term[a][0])+Float.parseFloat(term[a+1][0]))+" and "; sql_TN=sql_TN+"a"+(a-c)+"<"+term[a][0]+" or "+"a"+(a-c)+">"+(Float.parseFloat(term[a][0])+Float.parseFloat(term[a+1][0]))+" or "; } } c=c+1; } for (a=(DynamicAttribute*2)+1;a<=(AttributeCount+DynamicAttribute);a++) { if (term[a][1].equals("T")) { sql_TP=sql_TP+"a"+(a-DynamicAttribute)+"="+term[a][0]+" and "; sql_TN=sql_TN+"a"+(a-DynamicAttribute)+"<>"+term[a][0]+" or "; } } if (sql_TN.substring((sql_TN.length()-4),sql_TN.length()).equals(" or ")) { sql_TN=sql_TN.substring(0,(sql_TN.length()-4)); } sql_TN=sql_TN+") and "; sql_FN=sql_TN+"a"+(AttributeCount+1)+"="+term[AttributeCount+DynamicAttribute+1][0]; sql_TN=sql_TN+"a"+(AttributeCount+1)+"<>"+term[AttributeCount+DynamicAttribute+1][0]; sql_FP=sql_TP+"a"+(AttributeCount+1)+"<>"+term[AttributeCount+DynamicAttribute+1][0]; sql_TP=sql_TP+"a"+(AttributeCount+1)+"="+term[AttributeCount+DynamicAttribute+1][0]; rs2=s2.executeQuery(sql_FP); rs2.first(); FP=rs2.getFloat(1); rs2=s2.executeQuery(sql_FN); rs2.first(); FN=rs2.getFloat(1); rs2=s2.executeQuery(sql_TP); rs2.first(); TP=rs2.getFloat(1); rs2=s2.executeQuery(sql_TN); rs2.first(); TN=rs2.getFloat(1);

(33)

if ((TP==0 && FN==0) || (TN==0 && FP==0)) { Q=0; } else { Q=(TP/(TP+FN))*(TN/(FP+TN)); } // Q=(TP/(TP+FN))*(TN/(FP+TN)); initial_Q=Q;

sql_update="update cleveland.candidate set Q="+Q+",CAL='T' where ID="+rs.getFloat(AttributeCount+DynamicAttribute+4); k1=s3.executeUpdate(sql_update); /*判斷剩餘條件項數目*/ term_count=0; for (g=1;g<DynamicAttribute*2;g=g+2) { if (term[g][1].equals("T")) { term_count=term_count+1; } } for (g=(DynamicAttribute*2+1);g<=AttributeCount+DynamicAttribute;g++) { if (term[g][1].equals("T")) { term_count=term_count+1; } } if (term_count>=2) { temp_Q=Q; do { global_Q=-100; Arrays.fill(delta_Q,0,AttributeCount+DynamicAttribute+1,-2); for (h=1;h<=(AttributeCount+DynamicAttribute);h++) { temp_term1=""; temp_term2=""; if (h<=DynamicAttribute*2) { if (h%2==1) { if (term[h][1].equals("T")) { temp_term1=term[h][0]; temp_term2=term[h+1][0]; term[h][0]="99"; term[h][1]="F"; term[h+1][0]="99"; term[h+1][1]="F";

sql_TP="select count(*) from cleveland.source where ";

sql_TN="select count(*) from cleveland.source where ("; c=0;

(34)

for (a=1;a<(DynamicAttribute*2);a=a+2) { if (term[a][1].equals("T")) { if (Float.parseFloat(term[a+1][0])<=0) { sql_TP=sql_TP+"a"+(a-c)+"<"+term[a][0]+" and "+"a"+(a-c)+">"+(Float.parseFloat(term[a][0])+Float.parseFloat(term[a+1][0]))+" and "; sql_TN=sql_TN+"a"+(a-c)+">"+term[a][0]+" or "+"a"+(a-c)+"<"+(Float.parseFloat(term[a][0])+Float.parseFloat(term[a+1][0]))+" or "; } else { sql_TP=sql_TP+"a"+(a-c)+">"+term[a][0]+" and "+"a"+(a-c)+"<"+(Float.parseFloat(term[a][0])+Float.parseFloat(term[a+1][0]))+" and "; sql_TN=sql_TN+"a"+(a-c)+"<"+term[a][0]+" or "+"a"+(a-c)+">"+(Float.parseFloat(term[a][0])+Float.parseFloat(term[a+1][0]))+" or "; } } c=c+1; } for (a=(DynamicAttribute*2)+1;a<=(AttributeCount+DynamicAttribute);a++) { if (term[a][1].equals("T")) { sql_TP=sql_TP+"a"+(a-DynamicAttribute)+"="+term[a][0]+" and "; sql_TN=sql_TN+"a"+(a-DynamicAttribute)+"<>"+term[a][0]+" or "; } } if (sql_TN.substring((sql_TN.length()-4),sql_TN.length()).equals(" or ")) { sql_TN=sql_TN.substring(0,(sql_TN.length()-4)); } sql_TN=sql_TN+") and "; sql_FN=sql_TN+"a"+(AttributeCount+1)+"="+term[AttributeCount+DynamicAttribute+1][0]; sql_TN=sql_TN+"a"+(AttributeCount+1)+"<>"+term[AttributeCount+DynamicAttribute+1][0]; sql_FP=sql_TP+"a"+(AttributeCount+1)+"<>"+term[AttributeCount+DynamicAttribute+1][0]; sql_TP=sql_TP+"a"+(AttributeCount+1)+"="+term[AttributeCount+DynamicAttribute+1][0]; // System.out.println(sql_FP); rs2=s2.executeQuery(sql_FP); rs2.first(); FP=rs2.getFloat(1); // System.out.println(sql_FN); rs2=s2.executeQuery(sql_FN); rs2.first(); FN=rs2.getFloat(1); // System.out.println(sql_TP); rs2=s2.executeQuery(sql_TP); rs2.first(); TP=rs2.getFloat(1); // System.out.println(sql_TN); rs2=s2.executeQuery(sql_TN); rs2.first(); TN=rs2.getFloat(1); if ((TP==0 && FN==0) || (TN==0 && FP==0)) {

(35)

} else { Q=(TP/(TP+FN))*(TN/(FP+TN)); } term[h][0]=temp_term1; term[h+1][0]=temp_term2; term[h][1]="T"; term[h+1][1]="T"; delta_Q[h]=Q-temp_Q; if (delta_Q[h]>global_Q) { global_Q=delta_Q[h]; } } } } else { if (term[h][1].equals("T")) { temp_term1=term[h][0]; term[h][0]="99"; term[h][1]="F";

sql_TP="select count(*) from cleveland.source where ";

sql_TN="select count(*) from cleveland.source where ("; c=0; for (a=1;a<(DynamicAttribute*2);a=a+2) { if (term[a][1].equals("T")) { if (Float.parseFloat(term[a+1][0])<=0) { sql_TP=sql_TP+"a"+(a-c)+"<"+term[a][0]+" and "+"a"+(a-c)+">"+(Float.parseFloat(term[a][0])+Float.parseFloat(term[a+1][0]))+" and "; sql_TN=sql_TN+"a"+(a-c)+">"+term[a][0]+" or "+"a"+(a-c)+"<"+(Float.parseFloat(term[a][0])+Float.parseFloat(term[a+1][0]))+" or "; } else { sql_TP=sql_TP+"a"+(a-c)+">"+term[a][0]+" and "+"a"+(a-c)+"<"+(Float.parseFloat(term[a][0])+Float.parseFloat(term[a+1][0]))+" and "; sql_TN=sql_TN+"a"+(a-c)+"<"+term[a][0]+" or "+"a"+(a-c)+">"+(Float.parseFloat(term[a][0])+Float.parseFloat(term[a+1][0]))+" or "; } } c=c+1; } for (a=(DynamicAttribute*2)+1;a<=(AttributeCount+DynamicAttribute);a++) { if (term[a][1].equals("T")) { sql_TP=sql_TP+"a"+(a-DynamicAttribute)+"="+term[a][0]+" and "; sql_TN=sql_TN+"a"+(a-DynamicAttribute)+"<>"+term[a][0]+" or "; } }

演化式演算法應用於資料探勘之研究(II)

行政院國家科學委員會專題研究計畫 成果報告

演化式演算法應用於資料探勘之研究(2/2)

計畫類別： 個別型計畫

計畫編號： NSC93-2213-E-009-028-

執行期間： 93 年 08 月 01 日至 94 年 07 月 31 日

執行單位： 國立交通大學工業工程與管理學系(所)

計畫主持人： 沙永傑

計畫參與人員： 劉正祥

報告類型： 完整報告

報告附件： 出席國際會議研究心得報告及發表論文

處理方式： 本計畫可公開查詢

中 華 民 國 94 年 9 月 29 日

目錄

目錄

...I

表目錄

... II

圖目錄

... II

中文摘要

...1

英文摘要

...2

前言與動機

...3

研究目的

...4

文獻探討

...4

研究方法

...10

結果與討論

...16

參考文獻

...18

計畫成果自評

...19

附錄一 Ant-Classifier 程式碼...20

出席國際學術會議心得報告

...44

表目錄

表一 基因演算法應用於分類技術研究之彙整...5

表二 GA-Classifier 功能說明 ...11

表三 Ant-Classifier 功能說明...13

表四 績效指標內容說明...13

表五 資料庫說明...16

表六 分類錯誤率...17

圖目錄

圖一 基因演算法流程圖（王培珍, 1996）...7

圖二 在現實世界裡螞蟻的行為模式 (Dorigo, 1996)...8

圖三 Ant System 應用在搜尋最佳解問題之執行流程...10

圖四 GA-Classifier 之執行流程 ...12

圖五 Ant-Classifier 演算法之內容...15

圖六 實驗架構圖...17

中文摘要

在知識發掘與資料探勘的領域中，分類技術被視為一個重要的研究議題。分類的目

的在於定義每一個類別的特徵，透過訓練組的資料，建立一個判斷類別歸屬的模型，將

未歸類的資料分門別類。

一般來說，資料類型可分成兩大類：數值型、類別型。在真實生活中，數值型資料

之存在是相當普遍的，但是大多數的分類預測方法，只能處理類別型資料。針對數值型

資料大多數研究會在資料前置處理（preprocessing）過程中將其離散成類別型資料，之

後再利用資料探勘工具萃取分類規則。然而任何離散數值型資料的方法，都會造成原先

隱藏於資料當中的資訊流失。因此本研究希望能發展一套分類預測方法，能在不離散數

值型資料情況下進行分類規則的萃取。

本研究共分成兩個階段，第一階段將以基因演算法（GA）為主要研究工具，希望

藉由基因演算法的高效率與彈性，設計出能同時處理數值型與類別型資料的分類預測方

法。第二階段為針對螞蟻理論（Ant System）設計出能同時處理數值型與類別型資料的

分類預測方法，並比較兩種演化演算法在分類績效上的差異。

關鍵詞：資料探勘、離散化、基因演算法、螞蟻理論

英文摘要

Classification is one of the important issues in knowledge discovery and data mining.

The goal of classification is to define the characteristics for each class in order to predict if a

previously unknown object either belongs to the class or not.

In general, attribute data type can divide into two groups: numerical and categorical.

Numerical attributes are very common in real-world application. There exist a large number

of classification algorithms, which handle categorical attributes only. Therefore, the process

of the descretization is an essential task for data preprocessing in knowledge discovery in

databases (KDD). But during the discretization of numerical attribute, some information

hidden in the data set can be lost, we will proposed classification algorithms can extract

行政院國家科學委員會專題研究計畫成果報告

計畫類別：個別型計畫

執行單位：國立交通大學工業工程與管理學系(所)

計畫主持人：沙永傑

計畫參與人員：劉正祥

報告類型：完整報告

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫可公開查詢

中華民國 94 年 9 月 29 日

表一基因演算法應用於分類技術研究之彙整...5

表四績效指標內容說明...13

表五資料庫說明...16

表六分類錯誤率...17

圖一基因演算法流程圖（王培珍, 1996）...7

圖二在現實世界裡螞蟻的行為模式 (Dorigo, 1996)...8

圖六實驗架構圖...17

Berry(1997) 定義資料探勘結合許多不同的技術，如資料視覺化（ Data

資料探勘的相關作業應用可區分成五種

[Collard, 2001] ：資料特徵描述

Databases, KDD ）的資料前置處理（ preprocessing ）過程中使用離散工具 (C4.5