應用模糊集合理論及自動分群演算法以在關聯式資料庫系統中估計空值之新方法研究

(1)

行政院國家科學委員會專題研究計畫成果報告

應用模糊集合理論及自動分群演算法以在關聯式資料庫系統中估計空值之新方法研究(第 2 年)

研究成果報告(完整版)

計畫類別：個別型

計畫編號： NSC 95-2221-E-011-117-MY2

執行期間： 96 年 08 月 01 日至 97 年 07 月 31 日執行單位：國立臺灣科技大學資訊工程系

計畫主持人：陳錫明

計畫參與人員：碩士班研究生-兼任助理人員：李庭魁碩士班研究生-兼任助理人員：簡誌堯博士班研究生-兼任助理人員：李立偉博士班研究生-兼任助理人員：白世銘

處理方式：本計畫可公開查詢

中華民國 97 年 09 月 05 日

(2)

應用模糊集合理論及自動分群演算法以在關聯式資料庫系統中 估計空值之新方法研究

(2/2)

Estimating Null Values in Relational Database Systems Based on the Fuzzy Set Theory and Automatic Clustering Techniques (2/2)

計畫編號：NSC 95-2221-E-011-117-MY2 執行期限：96 年 8 月 1 日至 97 年 7 月 31 日主持人：陳錫明國立台灣科技大學資訊工程系教授

一、中文摘要

本研究計畫為一個兩年期研究計畫之第二年計畫，旨在關聯式資料庫系統中提出估計空值的新方法。在本研究計畫的第二年計畫中所提之在關聯式資料庫系統中估計空值之方法是依據我們在第一年計畫中所提之自動分群演算法所做的分群結果以做估計空值，其僅須針對分群後的其中一個分群做估計空值的工作，無須針對整個資料庫系統內所有的資料做處理。我們並以員工資料庫為例以說明在本研究計畫的第二年計畫中所提之在關聯式資料庫系統中估計空值之方法。本研究計畫的第二年計畫中所提之在關聯式資料庫系統中估計空值的新方法比目前已存在的方法具有更高的平均估計準確率。另外，在本研究計畫的第二年計畫中，我們亦針對在關聯式資料庫系統中之屬性間具有負相關關係之關聯式資料庫提出一個新方法以估計空值。我們並以 Benz 二手車資料庫為例以說明在本研究計畫的第二年計畫中所提之在關聯式資料庫系統中估計空值之新方法，本研究計畫的第二年計畫中所提之在關聯式資料庫系統其屬性間具有負相關關係的資料庫中估計空值的新方法比目前已存在的方法具有更高的平均估計準確率。

關鍵詞：自動分群演法、分群標籤、模糊集合、模糊化、語義詞、歸屬函數、空值、

關聯式資料庫系統。

Abstract

This project is the second year’s project of a 2-years project. The purposes of this project are to present new methods for estimating null values in relational database systems. In the second year of this project, we propose a new method for estimating null values in relational database systems based on the automatic clustering algorithm proposed in the first year of this project. The proposed method for estimating null values in relational database systems only needs to process a particular cluster instead of the whole database. The average estimated accuracy rate of the method proposed in the second year of this project is higher than the existing methods for estimating null values in relational database systems. Moreover, in the second year of this project, we also propose a new method for estimating null values in relational database systems, which have negative dependency relationships between attributes. We also use the “Benz secondhand car database” to illustrate the process for estimating null values in relational database systems of the proposed method. The average estimated accuracy rate of the method proposed in the second year of this project is higher than the existing methods for estimating null values in relational database systems having

(3)

negative dependency relationships between attributes.

Keywords: Automatic Clustering Algorithm, Cluster Labels, Fuzzy Sets, Fuzzification, Linguistic Terms, Membership Functions, Null Values, Relational Database Systems.

二、計畫緣由與目的

本研究計畫為一個兩年期的研究計畫，旨在關聯式資料庫系統中提出一些更準確估計空值的新方法。在本研究計畫的第一年計畫中，我們提出一個新的自動分群演算法以在關聯式資料庫系統中作資料分群，我們所提之自動分群演算法不須事先定義群數及也不須事先將資料予以排序，使得分群時能更有彈性。在本研究計畫的第二年計畫中，我們根據我們在第一年計畫中所提之新的自動分群演算法提出一個新的估計空值之方法以在關聯式資料庫系統中估計空值。我們所提的方法是先將資料庫中的資料利用其相似性予以分群處理後再估計空值。大多數目前已存在的分群方法需要使用者事先定義出群數，且須將所有的資料予以排序，意即其對輸入資料的順序是敏感的，當其以不同資料輸入順序交由同一分群演算法處理時，其有可能會產生差別很大的分群結果。本研究計畫中所提之自動分群演算法可以克服目前已存在之方法的缺點。在本研究計畫的第二年計畫中所提之在關聯式資料庫系統中估計空值之方法是依據我們在第一年計畫中所提之自動分群演算法所做的分群結果以做估計空值，其僅須針對分群後的其中一個分群做估計空值的工作，無須針對整個資料庫系統內所有的資料做處理。我們並以員工資料庫為例以說明在本研究計畫的第二年計畫中所提之在關聯式資料庫系統中估計空

值之方法。本研究計畫的第二年計畫中所提之在關聯式資料庫系統中估計空值的新方法比目前已存在的方法具有更高的平均估計準確率。另外，在本研究計畫的第二年計畫中，我們亦針對在關聯式資料庫系統中之屬性間具有負相關關係之關聯式資料庫提出一個新方法以估計空值。我們並以Benz二手車資料庫為例以說明在本研究計畫的第二年計畫中所提之在關聯式資料庫系統中估計空值之新方法，本研究計畫的第二年計畫中所提之在關聯式資料庫系統其屬性間具有負相關關係的資料庫中估計空值的新方法比目前已存在的方法具有更高的平均估計準確率。

三、研究方法及成果

在本研究計畫之第二年計畫中，我們根據我們在第一年計畫中所提之新的自動分群演算法將關聯式資料庫系統中的資料予以分群後，提出一個新方法以估計空值。表一中所示為一個薪資資料庫，其中包含三個屬性(Attributes)資料，分別為 Degree、Experience、及 Salary。在此薪資資料庫中，員工的薪資(Salary)是依據其學位(Degree)及年資(Experience)而決定，亦即Degree 及 Experience 為自變數 (Independent Variables)且 Salary 為因變數 (Dependent Variable)。在進行估計空值時，我們利用統計學上的複迴歸分析 (Multiple Regression Analysis) 的觀念 [2],[34]。複迴歸分析乃是在研究兩個或兩個以上的自變數對因變數的影響的方法。基本上複迴歸分析是簡單迴歸分析 (Regression Analysis)的延伸，複迴歸分析因為考慮到多個影響因變數的自變數，因此運用在預測上較簡單迴歸分析更為準確[34]。我們可以把兩個自變數推廣到多

(4)

個自變數，研究判定兩個或多個變數 (Variables)間相互之關係，也就是說我們把解釋變數影響能力的迴歸係數 (Coefficient of Determination)應用於此，

而複迴歸公式定義[34]如下所示：

定義1:假設 X1, X2,及…,Xk為自變數且Y 為因變數，則複迴歸公式如下所示：

xk

x βk

β β

α

+ + + +

= 1 x 1 2 2 ...

Y

(1)

其中 α 為直線的Ｙ軸截距(Intercept)，且 β1,β2,…,及 βk為迴歸係數。

定義2:假設有兩個自變數 X 和 Z，及有一 個因變數Y，其中 Y = α+β1X+β2Z，則

∑ ∑

−

= − ₂

1 ( )

) )(

(

X X

Y Y X X

i i

β i , (2)

∑ ∑

−

= − ₂

2 ( )

) )(

(

Z Z

Y Y Z Z

i i

β i , (3)

其中Xi為自變數X 的第 i 個值，X 表示自變數X 的平均值，Zi為自變數Z 的第 i 個值，Z 表示自變數Z 的平均值，Ｙi為因變數Y 的第 i 個值，Y 表示因變數Ｙ的平均值，且β1及β^２為迴歸係數。

我們根據上述的概念，我們可以重新定義估計空值的方法。首先我們設計估計空值的公式，如下所示：

Salaryestimated = salcenter + Wd * β1 * (deg – Degcenter) + We * β^２ * (exp – Expcenter), (4) 其中 Salary 代表薪水的估計值；Salcenteri, Degcenteri, 及 Expcenteri分別為第i 群 Clusteri

中之群中心 Ci 的屬性 Salary, Degree 及 Experience 的屬性值，且 Wd與We分為β 與β2正規化後的加權係數，其中Wd + We

= 1，且

|

2 1

1 β β

β

= +

Wd , (5)

|

2 1

2 β β

β

= +

We . (6) 接下來，我們設計公式(4)中之 β1及β^２之計算公式。考慮公式(2)及公式(3)，我們以屬性中心點的值分別取代公式(2)及公式(3)中之平均值。因為儘管平均值是我們用來描述資料集合(Data Set)最有用的度量，但它並不是唯一的度量方法，對於資料內的最大值與最小值的差距很大時，使用平均值度量時，很難看出資料確切的分佈。因此為了克服此缺點，我們採取的方法則是以中心點來取代平均值，此時，公式(2)及公式(3)被修改為公式(7)及公式 (8)，如下所示：

) , (

) )(

(

1

∑ ∑

2

−

= −

center i

X X

Y Y X

β X (7)

) , (

) )(

(

2

∑ ∑

2

−

= −

center i

Z Z

Y Y Z

β Z (8)

其中Xi為自變數X 的第 i 個值，Zi為自變數的第i 個值，Yi為因變數Y 的第 i 個值，

Xcenter 示屬性Ｘ的中心值，Ｙcenter 表示屬

性Ｙ的中心值，且Ｚcenter表示屬性Ｚ的中心值。因此，公式(4)中 β1及β^２的計算公式如下：

) , (deg

) )(

(deg

1

∑ ∑

2

−

= −

center i

Deg

sal sal β Deg

(9) ) ,

(deg

) )(

(exp

2

∑ ∑

2

−

= −

center i

Deg

sal sal β Exp

(10) 我們依據上述所設計之公式，提出一個估計空值的新方法，如下所示：

(5)

步驟 1: 首先我們根據我們所提的新自動 分群演算法對已知的n 筆資料及 m 個屬性的屬性值作分群。假設 總共產生k 個分群，亦即 Cluster1, Cluster2, …, 及 Custerk,如下所示：

Cluster1 = { (Deg1,1, Exp1,1, Sal1,1 ), (Deg1,2,

Exp1,2, Sal1,2 ),…, (Deg1,p1, Exp1,p1,

Sal1,p1 )},

Cluster2 = { (Deg2,1,Exp2,1, Sal2,1 ), (Deg2,2,

Exp2,2, Sal2,2 ),…, (Deg2,p2,Exp2,p2,

Sal2,p2 )},

M

Clusterk = { (Degk,1,Expk,1, Salk,1 ), (Degk,2 ,Expk,2, Salk,2 ),…, (Degk,1Expk,pk, Salk, pk )},

其中1 ≤ k ≤ n，(Degi, j, Expi,j, Sali,j ) 代表第i 群 Cluster 中第 j 個資料 值，且pi代表Clusteri中的資料的個數。

步驟 2: 求得各分群之群中心，對於未納 入計算距離的屬性 Degree 的群中心則採平均值。

步驟 3: 根據公式(9)及公式(10)，分別計 算每一個分群內的屬性 Degree 對有空值的屬性 Salary 的判定係數及屬性 Experience 對有空值的屬性Salary 的判定係數。再依據公式(5)及公式(6)，將 β1及 β２予以正規化，以分別計算出加權係數Wd與We。

步驟 4: 假設屬性 Salary 中具有空值的員 工資料其 Degree-Experience 坐標為(deg, exp)。將(deg, exp)坐標與各分群之群中心的 Degree-Experience 坐標做歐幾里德距離(Euclidean Distance)的計算，以求出其與第 i 群 cluster i

中之群中心 Ci的距離 Disti ，其計算公式如下所示：

, exp) (

deg)

( _{_} − ² + _{_} − ²

= _i _center _i _center

i Deg Exp

Dist

(11)

其中 Degi_center及 Expi_center分別

為第 i 群 Clusteri之群中心之屬性Degree 的值及屬性 Experience 的值且1 ≤ i ≤ k。計算出坐標(deg, exp)與所有分群的群中心的距離 後，我們將與坐標(deg, exp)距離為最小的分群做為具有空值的員工資料應屬的分群，也就表示我們可以假設此員工的資料是此分群中的一個成員。假設與坐標(deg, exp) 距離最小的群為 cluster i , 其中 1 ≤ i ≤ k。

步驟 5: 根據公式(4)，利用第 i 群 Clusteri

中的資料其屬性 Degree 的屬性值及屬性 Experience 的屬性值以估計空值Salary 的值，其中

) (exp

*

) (deg

*

2

1 centeri e

centeri d

center estimated

Exp W

Deg W

sal Salary

−

+

− +

= β

β

其中Salcenteri，Degcenteri及Expcenteri

分別為第i 群 Clusteri中之群中心 Ci 的屬性 Salary ， Degree 及 Experience 的屬性值。

根據我們上面所提的方法，當此員工的薪資被估計出來之後，我們可以用公式 (12)來計算其估計誤差(Estimated Error)：

100%.

實際值實際值估計誤差估計值 − ×

=

(12)

(6)

我們根據所提的方法來估計表一中所示之員工資料庫之各員工的”Salary”，

並與參考文獻[8]、參考文獻[11]、及參考文獻[24]所提之方法的估計值作比較，其比較結果如表一所示。在表一中，我們列出各員工的薪資估計值及其估計誤差。由

表一我們可以得知本研究計畫之第二年計畫中所提的方法比在參考文獻[8]、參考文獻[11]及參考文獻[21]中所提的方法具有更高之平均估計準確率以在估計關聯式資料庫系統中估計空值。

在本研究計畫之第二年計畫中，我們亦根據我們在第一年計畫中所提的新的自動分群演算法提出一個新方法來估計在關聯式資料庫系統中具有負相關屬性關係(Negative Dependency Relationship between Attributes)之資料庫中的空值，以

表二中所示之 Benz 二手車資料庫[24]為例來說明之。從表二中我們可以得知屬性 Year 及屬性 Price 之間有負相關的關係，

此乃因汽車年資愈多則其價格愈低之緣故。

表一各方法估計結果之比較

Chen-and-Chen’s Method [8]

Chen-and-Yeh’s Method [11]

Chen-and- Hsiao’s Method [21]

The Proposed Method EMP

-ID Degree Expe- rience Salary

Salary (Estimated)

Estimated Error

1 Ph.D. 7.2 63,000 63000 + 0.00 65000 + 3.17 63000.00 + 0.000 63400.000 +0.006 2 Master 2 37,000 33711 - 8.89 30704 - 17.02 37002.256 - 0.006 37000.000 +0.000 3 Bachelor 7 40,000 46648 + 16.62 35000 - 12.50 39665.621 - 0.836 39672.475 -0.008 4 Ph.D. 1.2 47,000 36216 - 22.94 46000 - 2.13 47000.00 + 0.000 47000.000 +0.000 5 Master 7.5 53,000 56200 + 6.04 54500 + 2.83 52431.645 - 1.072 53000.000 +0.000 6 Bachelor 1.5 26,000 27179 + 4.53 26346 + 1.33 26000.000 + 0.000 26000.000 +0.000 7 Bachelor 2.3 29,000 29195 + 0.67 28500 - 1.72 29000.000 + 0.000 29000.000 +0.000 8 Ph.D. 2 50,000 39861 - 20.28 50000 + 0.00 53105.080 + 6.210 50433.043 +0.009 9 Ph.D. 3.8 54,000 48061 - 11.00 55000 + 1.85 53881.920 - 0.219 54000.000 +0.000 10 Bachelor 3.5 35,000 32219 + 7.95 31538 - 9.89 35624.436 + 1.784 35000.000 +0.000 11 Master 3.5 40,000 40544 + 1.36 41590 + 3.98 40673.849 + 1.685 40163.991 +0.004 12 Master 3.6 41,000 41000 + 0.00 45159 + 10.14 40660.005 - 0.829 40164.005 -0.020 13 Master 10 68,000 64533 - 5.10 65000 - 4.41 68000.000 + 0.000 68000.000 +0.000 14 Ph.D. 5 57,000 55666 - 2.34 55000 - 3.51 54399.814 - 4.562 57000.000 +0.000 15 Bachelor 5 36,000 35999 - 0.00 35000 - 2.78 35374.435 - 1.738 36000.000 +0.000 16 Master 6.2 50,000 51866 + 3.73 48600 - 2.80 51870.594 + 3.741 51090.526 +0.022 17 Bachelor 0.5 23,000 24659 + 7.21 25000 + 8.70 23000.000 + 0.000 23000.000 +0.000 18 Master 7.2 55,000 55200 + 0.36 52400 - 4.73 52302.172 - 4.905 55000.00 +0.000 19 Master 6.5 51,000 52866 + 3.66 49500 - 2.94 52000.067 + 1.961 51113.226 +0.002 20 Ph.D. 7.8 65,000 65000 + 0.00 65000 + 0.00 64200.000 - 1.231 63800.000 -0.019 21 Master 8.1 64,000 58200 - 9.06 58700 - 8.28 64800.000 + 1.250 64000.000 +0.000 22 Ph.D. 8.5 70,000 67333 - 3.81 65000 - 7.14 70000.000 -+0.000 70000.000 +0.000

Average Estimated Error

(%) 6.16 % 5.08 % 1.456 % 0.41%

(7)

表二 Benz 二手車的資料庫[24]

Car

-ID Style Year C.C. Price Car

-IDStyle Year C.C. Price Car

-IDStyle Year C.C. Price 1 C 9 1800 590000 51 C 5 2400 965000 101 E 9 2200 660000 151 E 5 3200 1420000 2 C 9 1800 580000 52 C 5 2400 1070000 102 E 9 2200 680000 152 E 4 2000 1190000 3 C 9 1800 600000 53 C 5 2400 1100000 103 E 9 2800 780000 153 E 4 2000 1080000 4 C 9 2200 586000 54 C 5 2400 1160000 104 E 9 2800 710000 154 E 4 2800 1580000 5 C 9 2200 586000 55 C 5 2400 990000 105 E 9 2800 720000 155 E 4 2800 1780000 6 C 9 2200 590000 56 C 5 2600 550000 106 E 9 3200 650000 156 E 4 3200 680000 7 C 9 2200 638000 57 C 5 2800 1050000 107 E 9 3200 658000 157 E 3 2000 1350000 8 C 9 2200 635000 58 C 5 2800 1260000 108 E 9 3200 650000 158 E 3 2000 1180000 9 C 9 2200 608000 59 C 5 2800 1150000 109 E 9 3200 760000 159 E 3 2400 1480000 10 C 9 2200 620000 60 C 5 2800 1150000 110 E 9 3200 720000 160 E 3 2400 1480000 11 C 9 2200 620000 61 C 4 2300 1150000 111 E 8 2200 750000 161 E 3 2400 1630000 12 C 9 2200 650000 62 C 4 2400 1210000 112 E 8 2200 760000 162 E 3 2400 1380000 13 C 9 2200 690000 63 C 4 2400 1080000 113 E 8 2800 725000 163 E 3 2400 1380000 14 C 9 2200 720000 64 C 4 2800 1190000 114 E 7 2300 920000 164 E 2 2400 760000 15 C 9 2800 728000 65 C 4 2800 1260000 115 E 7 2300 980000 165 E 2 2400 760000 16 C 9 2800 690000 66 C 3 1800 1060000 116 E 7 2300 999000 166 E 2 2600 1680000 17 C 9 2800 800000 67 C 3 1800 1100000 117 E 7 2300 1080000 167 E 2 2600 1760000 18 C 9 2800 835000 68 C 3 2000 614000 118 E 7 3200 1080000 168 E 2 2600 1730000 19 C 8 2200 780000 69 C 3 2000 510000 119 E 7 3200 1280000 169 E 2 2800 1690000 20 C 8 1800 660000 70 C 3 2000 880000 120 E 7 3200 1030000 170 E 2 2800 1750000 21 C 8 1800 630000 71 C 3 2000 728000 121 E 7 3200 1280000 171 E 2 2800 2080000 22 C 8 2200 720000 72 C 3 2000 550000 122 E 7 3200 1150000 172 E 2 2800 1580000 23 C 8 2200 738000 73 C 3 2300 1150000 123 E 7 3200 1280000 173 E 2 3200 930000 24 C 8 2200 780000 74 C 3 2300 1350000 124 E 7 3200 1180000 174 E 2 3200 930000 25 C 8 2200 730000 75 C 3 2300 1250000 125 E 7 3200 1250000 175 E 1 2000 1680000 26 C 8 2200 720000 76 C 3 2400 1180000 126 E 7 3200 1150000 176 E 1 2600 1900000 27 C 8 2200 700000 77 C 3 2800 1290000 127 E 6 2300 1090000 177 E 1 2800 2050000 28 C 7 1800 690000 78 C 3 2800 1290000 128 E 6 2300 1250000 178 E 1 3200 2330000 29 C 7 2200 790000 79 C 3 3200 1900000 129 E 6 2300 1120000 179 S 1 3200 900000 30 C 7 1800 620000 80 C 2 2500 1500000 130 E 6 2300 980000 180 S 11 3200 820000 31 C 7 2200 850000 81 C 2 2600 1280000 131 E 6 2300 1050000 181 S 10 3200 788000 32 C 7 2200 830000 82 C 2 2600 1360000 132 E 6 2300 990000 182 S 10 3200 698000 33 C 7 2200 860000 83 C 2 2600 1490000 133 E 6 2300 1290000 183 S 10 3200 790000 34 C 7 2200 830000 84 C 2 2600 1500000 134 E 6 2300 1150000 184 S 10 3200 790000 35 C 7 2200 760000 85 C 2 2600 1580000 135 E 6 2300 1180000 185 S 10 3200 870000 36 C 7 2200 750000 86 C 2 2600 1280000 136 E 6 2300 1250000 186 S 10 3200 800000 37 C 7 2800 930000 87 C 2 3200 1650000 137 E 6 2300 1200000 187 S 10 3200 900000 38 C 7 2800 880000 88 C 2 3200 1650000 138 E 6 2800 1060000 188 S 9 3200 960000 39 C 7 2800 920000 89 C 2 3200 1650000 139 E 6 2800 1090000 189 S 9 3200 898000 40 C 7 2800 920000 90 C 2 3200 1650000 140 E 6 2800 1100000 190 S 9 3200 968000 41 C 7 2800 880000 91 C 2 3200 1980000 141 E 6 2800 1250000 191 S 9 3200 898000 42 C 6 2300 880000 92 C 2 3200 1440000 142 E 6 2800 1180000 192 S 9 3200 880000 43 C 6 2300 960000 93 C 1 3200 1880000 143 E 5 3200 1280000 193 S 8 3200 1280000 44 C 6 2300 960000 94 C 1 3200 2050000 144 E 5 2300 1150000 194 S 8 3200 1400000 45 C 6 2300 980000 95 C 1 3200 1800000 145 E 5 2300 1280000 195 S 7 3200 1400000 46 C 6 2300 885000 96 C 1 2600 1650000 146 E 5 2800 1300000 196 S 7 3200 1500000 47 C 6 2800 1050000 97 E 10 2200 560000 147 E 5 2800 1400000 197 S 6 3200 1520000 48 C 5 2400 960000 98 E 10 3200 580000 148 E 5 2800 1380000 198 S 6 3200 1580000 49 C 5 2400 1160000 99 E 10 3200 530000 149 E 5 3200 1480000 199 S 6 3200 1520000 50 C 5 2400 1020000 100 E 9 2200 630000 150 E 5 3200 1380000 200 S 5 3200 1780000

(8)

根據表二中所示之 Benz 二手車資料庫的屬性 Price 的屬性值，我們可建立其起始的歸屬函數，如圖一中所示。

下面我們將我們在第一年計畫中所

提之新的自動分群方法運用在表二中所示之Benz 二手車資料庫以估計空值，其最後之分群結果如表三所示。

8 5

1.0 ^VVL ^VL ^L ^M ^H

3.5 11 14 17 20

Price(10⁵ dollars) 23

VH VVH

24

Membership Grade

圖一 Benz 二手車價格之語義歸屬函數表三最後之分群結果

Clusters 以CAR-ID 表示之 Tuples Clusters 以CAR-ID 表示之 Tuples

Cluster1 7, 8, 9, 10, 11, 12, 20, 21, 30, 68, 100, 101, 106,

107, 108 Cluster17 178

Cluster2 1, 2, 3, 4, 5, 6, 97, 98 Cluster18 69 Cluster3 17, 19, 24, 29, 32, 34, 35, 103, 109, 112, 164,

165, 180, 181,183, 184, 186 Cluster19 99 Cluster₄ 13, 14, 15, 16, 22, 23, 25, 26, 27, 28, 36, 71,

102, 104, 105, 110, 111, 113, 156, 182 Cluster₂₀ 56, 72 Cluster₅ 33, 38, 41, 42, 46, 70, 179, 185, 187, 189, 191,

192 Cluster₂₁ 18, 31

Cluster6 47, 52, 53, 57, 63, 66, 67, 117, 118, 120, 127,

129, 131, 138, 139, 140, 153 Cluster22 37, 39, 40, 114, 173, 174 Cluster₇ 49, 54, 59, 60, 61, 62, 64, 73, 76, 122, 124, 126,

134, 135, 137, 142, 144, 152, 158 Cluster₂₃ 55, 116, 132 Cluster₈ 43, 44, 45, 48, 51, 115, 130, 188, 190 Cluster₂₄ 50 Cluster₉ 58, 65, 75, 77, 78, 81, 86, 119, 121, 123, 125,

128, 133, 136, 141, 143, 145, 146, 193 Cluster₂₅ 74, 157 Cluster₁₀ 82, 147, 148, 150, 162, 163, 194, 195 Cluster₂₆ 151 Cluster₁₁ 80, 83, 84, 149, 159, 160, 196, 197, 199 Cluster₂₇ 92

Cluster₁₂ 87, 88, 89, 90, 96, 161 Cluster₂₈ 85, 154, 172, 198

Cluster₁₃ 155, 167, 170, 200 Cluster₂₉ 168

Cluster14 166, 169, 175 Cluster30 95

Cluster15 79, 93, 176 Cluster31 91

Cluster16 94, 171, 177

(9)

下面我們以表二中所示之 Benz 二手車資料庫做為估計空值之範例，我們欲根據本論文所提的新方法來估計空值。首先我們須把屬性 Style 的屬性值數值化，其中我們將”C”指定為 1.00，將”E”指定為 2.00，且將”S”指定為 3.00。因表二中所示之Benz 二手車資料庫共有三個自變數, 故利用定義1 得知,

Z

W

3 X 2 1

Y =

α

+

β

+

β

+

β , (13) 又因為Benz 二手車資料庫中共有三個自變數，故利用定義2 得知，假設有三個變數X、W 和 Z，其中 X、W 與 Z 為自變數且Y 為因變數，則計算 β1、β^２及β3的公式如下：

∑ ∑

−

= − ₂

1 ( )

) )(

(

center i

X X

Y Y X

β X ,

(14)

∑ ∑

−

= − ₂

2 ( )

) )(

(

center i

W W

Y Y W

β W ,

(15)

∑ ∑

−

= − ₂

3 ( )

) )(

(

center i

Z Z

Y Y Z

β Z ,

(16)

其中Xi為屬性X 中的第 i 個值， Yi為屬性Y 中的第 i 個值，Ｗi為屬性Ｗ中的第i 個值，Zi為屬性 Z 中的第 i 個值，Xcenter

為屬性Ｘ的中心值，Ｙcenter表示屬性Ｙ的

中心值，Ｗcenter表示屬性Ｗ的中心值，且

Ｚcenter表示屬性Ｚ的中心值，且假設具有

空值的tuple 以(sty, year, c.c.)表示，則估 計屬性Price 之空值的公式如下：

) (

*

Price_estimated =price_i_,_center +Ws_i β_i₁ sty−Style_i_,_center ) . . . (

*

* ) (

*

* i2 i,center i i3 i,center

i year Year Wcc cc CC

Wy − + −

+ β β (20)

其中Wsi、Wyi及Wcci分別為β1i、β2i及β3i

正規化後的加權係數，且Wsi +Wyi + Wcci

= 1。

在本研究計畫之第二年計畫中所提之在表二中所之Benz 二手車資料庫中估計空值的新方法如下所示，其中屬性Year 和屬性Price 間具有負相關關係:

步驟 1:根據我們所提之自動分群演算法 對屬性Price 已知的 n 筆資料及 m 個屬性

值作分群。假設總共產生 k 群，亦即

Cluster 1, Cluster 2,…,及 Custer k ，如下所示：

Cluster 1 = {(Sty1,1, Year1,1, C.C1,1, Price1,1), (Sty1,2, Year1,2, C.C1,2,

Price1,2),…, (Sty1,p, Year1,p, C.C1,p, Price1,p1)},

Cluster 2 = {(Sty2,1, Year2,1, C.C2,1, Price2,1), (Sty2,2, Year2,2,C.C2,2,

Price2,2),…, (Sty2,p,Year2,p, C.C2,p, Price2,p2)},

M

Cluster k = {(Styk,1, Yeark,1, C.Ck,1, Pricek,1), (Styk,2, Yeark,2, C.Ck,2,

Pricek,2),…, (Styk, pk, Yeark,pk, C.Ck,pk, Pricek,pk)},

其中1 ≤ k ≤ n,(Styi,j, Yeari, j, C.Ci,j, Pricei, j ) 代表第i 群 Cluster i中之第j 個元素, pi代表Cluster i中之元素個數，且1 ≤ i ≤ k。

步驟 2: 求得各分群的群中心，對於未納 入計算距離的屬性Style 的群中心則採用平均值。

步驟 3: 根據公式(17)，(18)及(19)計算每 一分群之屬性Style 對屬性 Price 的判定係數β1，屬性Year 對屬性 Price 的判定係數 β^２，屬性 C.C.對屬性 Price 的判定係數 β3，然後再予以正規化的計算，計算出加權係數Wsi、Wyi及Wcci ，所下所示：

|

3 2

1

β β

β

β +

= +

Wsi , (18)

(10)

|

3 2

1

3

β β

β

β +

= +

Wyi , (19)

|

3 2

1

2 β

β β

β +

= +

Wcci . (20)

步驟 4: 令在屬性 Price 中具有空值的 Car-ID 資料其 Sytle-Year-C.C.坐標為(sty, year, c.c)。將(sty, year, c.c)坐標與各分群 Clusteri 的群中心(Styi-center, Yeari-center,

C.C.i-center) 做歐幾里德距離 (Euclidean

Distance)的計算以求出與第 i 群 Cluster i

的群中心的距離Disti,計算公式如下：

2 , .) . .

( ) (

)

(Sty_{_} sty ² Year_{_} year ² CC _{_} cc

Dist_i = _i _center − + _i _center − + _i _center −

(21) 其中1 ≤ i ≤ k。然後，我們將與坐標(sty, year, c.c)距離為最小的分群做為具有空值 的Car-ID 資料應屬的分群，亦即我們可以假設此Car-ID 的資料做為此分群中的一個元素。假設與坐標(sty, year, c.c)距離最小的分群為clusteri，其中1 ≤ i ≤ k。

步驟 5: 根據公式 (17) ，利用第 i 群 Clusteri 中的資料及其屬性 Style、屬性 Year 及屬性 C.C.的屬性值以估計屬性 Price 的估計值 Priceestimated，其中

) (

*

* Pr

Price_estimated= ice_i₋_center+Ws_i β_i₁ sty−Style_i₋_center

_*^* _*^*₍⁽_._. ^._. _), ⁾

3 2

center i i

i

center i i

i

C C c c Wcc

Year year Wy

−

+

− +

β β

其中 Pricei-center、Stylei_center、Yeari-center及

C.C.i-center分別為第 i 群 Clusteri中之群中

心的屬性Price、屬性 Style、屬性 Year 及屬性C.C.的屬性值且 1 ≤ i ≤ k。

根據我們上面所提之方法，當此屬性Price為空值之Car-ID，我們可以用公式 (12)以計算其估計誤差(Estimated Error)。

我們根據所提出的方法以估計表二中所示之各Car-ID 的 Price。表四列出各方法平均誤差之比較，由表五我們可以得知本研究計畫之第二年計畫中所提之方法比參考文獻[9]及參考文獻[24]中所提的方法具有更高的平均估計準確率以估計關聯式資料庫系統中之空值。

表四各方法平均估計誤差之比較

Methods

Chen-and-Huang’s

Method [9]

Huang-and-Chen’s

Method [24]

The

Proposed

Method Average

Estimated Error

9.93%

(Population size = 30) 8.76%

(Population size = 30) 7.582%

四、結果與討論

本研究計畫為一個兩年期的研究計畫，旨在關聯式資料庫系統中提出估計空值的新方法。在本研究計畫的第一年計畫中，我們提出一個新的自動分群演算法以在關聯式資料庫系統中作資料分群，我們所提之自動分群演算法不須事先定義群數及也不須事先將資料予以排序，使得分群時能更有彈性。在本研究計畫的第二年計畫中，我們根據我們在第一年計畫中所提之新的自動分群演算法提出一個新的自動估計空值之方法以在關聯式資料庫系統中估計空值。我們所提的方法是先將資料庫中的資料利用其相似性予以分群處理再估計空值。大多數目前已存在的分群方法需要使用者事先定義出群數，且須將所有的資料予以排序，意即其對輸入資料的順序是敏感的，當其以不同資料輸入順序交由同一分群演算法處理時，其有可能會產生差別很大的分群結果。在本研究計畫的第二年計畫中所提之在關聯式資

(11)

料庫系統中估計空值之方法可以依據我們在第一年計畫中所提之自動分群演算法所做的分群結果以做估計空值，其僅須針對分群後的其中一個分群做估計空值的工作，無須針對整個資料庫系統內所有的資料做處理。我們並以員工資料庫為例以說明在本研究計畫的第二年計畫中所提之在關聯式資料庫系統中估計空值之方法，本研究計畫的第二年計畫中所提之在關聯式資料庫系統中估計空值的新方法比在參考文獻[8]，[11]及[21]中所提的的方法具有更高的平均估計準確率。在本研究計畫的第二年計畫中，我們亦針對在關聯式資料庫系統中之屬性間具有負相關關係之關聯式資料庫提出一個新方法以估計空值。我們並以 Benz 二手車資料庫為例以說明在本研究計畫的第二年計畫中所提之在關聯式資料庫系統中估計空值之方法，本研究計畫的第二年計畫中所提之在關聯式資料庫系統其屬性間具有負相關關係的資料庫中估計空值的新方法比比參考文獻[9]及[24]中所提的方法具有更高的平均估計準確率。

五、計畫成果自評

本研究計畫在理論與實際應用上均有很高的價值。本研究計畫之研究內容與原計畫相符程度為100%，也 100%達成預期目標。

在本研究計畫的經費支持下，目前我們已被接受及已發表下列之期刊論文及研討會論文，謹此致謝

[1] S. T. Chang and S. M. Chen,

“Estimating null values in relational database systems using automatic clustering and multiple regression techniques," Accepted and to appear in Expert Systems with Applications, 2009. (SCI and EI)

[2] S. M. Chen and S. T. Chang,

“Estimating null values in relational database systems having negative dependency relationships between attributes,” Accepted and to appear in Cybernetics and Systems, 2009. (SCI and EI)

[3] S. T. Chang and S. M. Chen, “A new approach for estimating null values in relational database systems using automatic clustering and multiple regression techniques,” Proceedings of the 11th Conference on Artificial Intelligence and Applications, Kaohsiung, Taiwan, Republic of China, pp. 1364-1371, 2006.

[4] S. M. Chen and S. T. Chang, “A new method to estimate null values in relational databases having negative dependency relationships between attributes,” Proceedings of 2008 Workshop on Consumer Electronics, Taipei County, Taiwan, Republic of China, 2008.

六、參考文獻

[1] A. Arslan and M. Kaya, “Determination of fuzzy logic membership functions using genetic algorithms,” Fuzzy Sets and Systems, vol. 118, no. 2, pp. 297-306, 2001.

[2] M. L. Bernson, D. M. Levine, and M.

Goldstein, Intermediate Statistical Methods and Applications. New Jersey:

Prentice-Hall, 1983.

[3] S. K. Bhatia and J. S. Deogun,

“Conceptual clustering in information retrieval,” IEEE Transactions on Systems, Man, and Cybernetics-Part B:

Cybernetics, vol. 28, no. 3, pp. 427-436, 1998.

[4] P. Bosc, D. Dubois, and H. Prade,

“Fuzzy functional dependencies and redundancy elimination,” Journal of the American Society for Information Science, vol. 49, no. 3, pp. 217-235, 1998.

[5] P. Bosc and L. Lietard, “Functional dependencies and the design of relational database extended to

(12)

imprecise data,” Proceedings of the conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU’98), Paris, France, pp. 412-419, 1998.

[6] D. G. Burkhardt and P. P. Bonissone,

“Automated fuzzy knowledge base generation and tuning,” Proceedings of the 1992 IEEE International Conference on Fuzzy Systems, San Diego, California, pp. 179-188, 1992.

[7] M. S. Chen and S. W. Wang, “Fuzzy clustering analysis for optimizing fuzzy membership functions,” Fuzzy Sets and Systems, vol. 103, no. 2, pp. 239-254, 1999.

[8] S. M. Chen and H. H. Chen,

“Estimating null values in the distributed relational databases environments,” Cybernetics and Systems, vol. 31, no. 8, pp. 851-871, 2000.

[9] S. M. Chen and C. M. Huang,

“Generating weighted fuzzy rules from relational database systems for estimating null values using genetic algorithms,” IEEE Transactions on Fuzzy Systems, vol. 11, no. 4, pp.

495-506, 2003.

[10] S. M. Chen and S. W. Lee, “A new method to generate fuzzy rules from relational database systems for estimating null values,” Cybernetics and Systems, vol. 34, no. 1, pp. 33-57, 2003.

[11] S. M. Chen and M. S. Yeh,

“Generation fuzzy rules from relational database systems for estimating null values,” Cybernetics and Systems, vol.

29, no. 6, pp. 363-376, 1998.

[12] S. M. Chen and Y. C. Chen,

“Automated constructing membership functions and generating fuzzy rules using genetic algorithms,” Cybernetics and Systems, vol. 33, no. 8, pp. 841-862, 2002.

[13] S. M. Chen, S. H. Lee, and C. H. Lee,

“A new method for generating fuzzy rules from numerical data for handling classification problems,” Applied

Artificial Intelligence, vol. 15, no. 7, pp.

645-664, 2001.

[14]Y. C. Chen and S. M. Chen,

“Constructing membership functions and generating fuzzy rules using genetic algorithms,” Proceedings of the 2001 Ninth National Conference on Fuzzy Theory and Its Applications, Chungli, Taoyuan, Taiwan, Republic of China, pp.

195-200, 2001.

[15] S. M. Chen and W. T. Jong, “Fuzzy query translation for relational database systems,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 27, no. 4, pp. 714-721,1997.

[16] D. A. Chiang, L. R. Cow, and N. C.

Hsien, “Fuzzy information in extended fuzzy relational databases,” Fuzzy Sets and Systems, vol. 92, no. 1, pp. 1-20, 1997.

[17] J. H. Chiang, “Support vector learning mechanism for fuzzy rule-based modeling: A new approach,” IEEE Transactions on Fuzzy Systems, vol.

12, no. 1, pp. 1-12, 2004.

[18] J. C. Cubero and M. A. Vila, “A new definition of fuzzy functional dependency in fuzzy relational databases,” International Journal of Intelligent Systems, vol. 9, no. 5, pp.

441-448, 1994.

[19] J. Grant, “Null values in a relational data base,” Information Processing Letters, vol. 6, no. 5, pp. 156-157, 1977.

[20] K. Honda and H. Ichihashi, “Linear fuzzy clustering techniques with missing values and their application to local principal component analysis,”

IEEE Transactions on Fuzzy Systems, vol. 12, no. 2, pp. 183-193, 2004.

[21] S. M. Chen and H. R. Hsiao, “A new method to estimate null values in relational database systems based on automatic clustering techniques,”

Information Sciences, vol. 69, no. 1-2, pp. 47-69, 2005.

[22] B. Gabrys and A. Bargiela, “General fuzzy min-max neural network for clustering and classification,” IEEE

應用模糊集合理論及自動分群演算法以在關聯式資料庫系統中估計空值之新方法研究

(2/2)

Estimating Null Values in Relational Database Systems Based on the Fuzzy Set Theory and Automatic Clustering Techniques (2/2)

計畫編號：NSC 95-2221-E-011-117-MY2 執行期限：96 年 8 月 1 日至 97 年 7 月 31 日 主持人：陳錫明 國立台灣科技大學資訊工程系 教授

+ + + +

= 1 x 1 2 2 ...

Y

∑ ∑

∑ ∑

∑ ∑

∑ ∑

∑ ∑

∑ ∑

3

X 2 1

Y =

+

+

+

∑ ∑

∑ ∑

∑ ∑

計畫編號：NSC 95-2221-E-011-117-MY2 執行期限：96 年 8 月 1 日至 97 年 7 月 31 日主持人：陳錫明國立台灣科技大學資訊工程系教授