資料探勘之敏感資料保護技術研發

(1)

行政院國家科學委員會專題研究計畫成果報告

資料探勘之敏感資料保護技術研發研究成果報告(精簡版)

計畫類別：個別型

計畫編號： NSC 97-2221-E-011-092-

執行期間： 97 年 08 月 01 日至 98 年 07 月 31 日執行單位：國立臺灣科技大學資訊工程系

計畫主持人：戴碧如

計畫參與人員：碩士班研究生-兼任助理人員：林柏佑碩士班研究生-兼任助理人員：姜弘霖碩士班研究生-兼任助理人員：林楊澤碩士班研究生-兼任助理人員：鍾至衡碩士班研究生-兼任助理人員：姜禮翔

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫涉及專利或其他智慧財產權，2 年後可公開查詢

中華民國 98 年 10 月 28 日

(2)

行政院國家科學委員會補助專題研究計畫 ■ 成果報告

□期中進度報告資料探勘之敏感資料保護技術研發

計畫類別：■ 個別型計畫 □ 整合型計畫計畫編號：NSC 97-2221-E-011-092-

執行期間： 97 年 8 月 1 日至 98 年 7 月 31 日

計畫主持人：戴碧如共同主持人：

計畫參與人員：林柏佑、姜弘霖、林楊澤、姜禮翔、鍾至衡

成果報告類型(依經費核定清單規定繳交)：■精簡報告 □完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

■出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、列管計畫及下列情形者外，得立即公開查詢

■涉及專利或其他智慧財產權，□一年□二年後可公開查詢

執行單位：

中華民國 98 年 7 月 31 日

(3)

I

（一）計畫中文摘要

隨著電腦與網路技術的日新月異及大量普及，人們產生與獲取資料越來越快速便捷，因而對於分析處理大量資料的技術更加重視，以至帶動了各式資料探勘相關技術的研究發展，其中頻繁項目集的探勘可以應用在很多不同的領域上，也是常常被使用於交易資料庫分析的技術。但是隨著技術日漸進步，所帶來隱私的議題也越來越受到關注。過去保護交易資料庫內敏感資訊的技術，都是以單一的最小支持度或是隱私係數來作為保護的基準，並且嘗試取得「機密資訊保護」和

「非機密資訊保存」之間的平衡點。但是在實際生活中，不同的項目集可能需要運用不同的最小支持度去判斷他是否為頻繁的，而在靈活的最小支持度設定下，

過去針對單一門檻值下頻繁項目集特性而設計的方法都不再適用，因此在本計畫中，我們加入多門檻值的設定，考量在多門檻值下頻繁項目集的特性，提出一個新的淨除演算法來保護機密的資訊，使得對敏感資訊的保護能更切合於它本身的特性，並藉此達到在修改過後的資料庫內保留更多的非機密資訊，亦即降低資訊遺失的程度。最後會對所提出之方法進行效率、擴充性和對資料庫影響最小化等各方面作最佳化的研究。

（二）計畫英文摘要

Frequent pattern mining is one of popular research topics in the data mining area.

With the advance of these techniques, privacy issues attract more and more attentions in recent years. In this field, previous works hide sensitive information based on a uniform support threshold or using a privacy disclosure parameter. The challenge is how to achieve a balance between the information preservation and the sensitivity protection. However, in practical applications, we probably need to apply different support thresholds on different itemsets to reflect their significance respectively. In this project, we design new hiding strategies to hide sensitive patterns with multiple support thresholds. With flexible user-defined multiple support thresholds, the hiding result is expected to be closer to user requirements and real applications. Besides, after hiding sensitive patterns and rules, the revised dataset is expected to preserve most information of the original dataset. Furthermore, in this project, we also try to extend our method to optimize the efficiency, the scalability, and the preservation of information while hiding sensitive patterns.

(4)

II

目錄

目錄 ... II

一、前言 ... 1

二、研究目的 ... 1

三、文獻探討 ... 2

四、研究方法 ... 3

淨除演算法 ... 3

敏感項目集表 ... 4

樣式表 ... 4

選擇策略及更新步驟 ... 5

五、結果與討論 ... 6

參考文獻 ... 8

計畫成果自評 ... 10

附件一、已報表之國際會議論文：

Ya-Ping Kuo, Pai-Yu Lin and Bi-Ru Dai, "Hiding Frequent Patterns under Multiple Sensitive Thresholds," Proceedings of the 19th International Conference on

Database and Expert Systems Applications (DEXA 2008), Turin, Italy, September 1-5, 2008. (Lecture Notes in Computer Science 5181 Springer 2008, ISBN

978-3-540-85653-5) (EI)

附件二、可供推廣之研發成果資料表

(5)

1

一、前言

近年來隨著網際網路的普及以及資料庫技術的純熟，資料的產生與收集都更加便利，除了組織和公司團體都各自擁有了大量的資料累積，個人用戶也成為各式資料來源的貢獻者，造就了資料量快速且大量成長的現象，但是這一堆未經過處理的資料通常不適合直接地運用於一些商業決策或醫學、股票、環境等應用上面，因此，資料擁有者對於進一步之資料分析處理技術的需求日漸增加，也使得資料探勘(data mining)相關技術更加受到重視[1]。過去有許多學者對資料探勘下了很多不同的定義，又可以把它稱為資料庫的知識挖掘，簡稱為 KDD (Knowledge Discovery in Databases)。一般而言，資料探勘指得是從大量資料當中萃取出合適的資料，進行資料處理、挖掘，以找出專家未知的或是有趣的、有用的、潛在的資訊，作為決策的參考依據。

隨著資料探勘技術日漸提昇，我們可以從雜亂無章的資料當中探索出更多有用的資訊，但是同時也面臨到一個值得重視的問題⎯⎯隱私(Privacy)。從資料庫當中分析萃取出來的資訊有時候可能是一些公司組織的機密，或是牽涉到個人的隱私權，所以在探尋隱含的有用資訊之際，同時也要能確保不會洩漏出機密或敏感的資訊。然而，保護機密資訊的方式，若只是單純地將機密資訊從資料庫裡刪除，再釋放出修改過的資料庫，很有可能經由推論或探勘的方法，從非機密的資訊中推演得知機密資訊，因此，在設計資料保護機制時，不但要考慮到機密的資訊必須避免經由資料探勘的技術被獲取，還要更進一步考慮，從所有的非機密資訊當中應該都不可以推得機密的資訊。除此之外，資料探勘分析的目的是希望取得有用的資訊來幫助制定決策或是其他運用方面，所以在保護機密資訊的同時，

也必須注意到原本資訊保存的最大化，並且盡量不要產生額外的非真實資訊，以確保原本資料庫本身的價值。

二、研究目的

計劃中，我們針對從交易資料庫中隱藏敏感項目集的問題上，在[2]中已經完整的解釋隱藏敏感項目集的動機及重要性。正如以下所述，假設大部份的顧客買牛奶通常也會買 Green Paper 這家造紙公司的紙。如果另一家造紙公司 Dedtrees Paper 從一個超級市場的資料庫中探勘出這樣的一個法則，然後發布一個促銷活動「如果你買 Dedtrees Paper 的紙，再買牛奶，你的牛奶將會得到 50 元的折扣」

接著 Green Paper 造紙公司的銷售量就會因為這樣的一個商業策略而降低。出於這個原因 Green Paper 造紙公司不會想要提供較低的價格給超級市場。另一方面，Dedtrees Paper 造紙公司已經達到他的目的也再也不會有意願提供較低的價格給超級市場。那超級市場就會遭遇到相當嚴重的損失，因此在公佈資料庫之前必須先對敏感資料做淨除動作。

(6)

2

之前大部份的淨除演算法並沒有考慮以下的問題，而是只使用一個使用者自定的最小支持度門檻值。首先，對每個不同的項目集，使用單一支持度門檻值在實際的應用中並不合理。比如高價格產品或是最近新產品的支持度，像電腦的支持度本身就會比水這類一般及普遍的產品低。但這樣並不代表說後者較前者來得有重要性。如果我們將所有敏感項目集的支持度都降到同一個最小支持度門檻值下，可能會導致一些項目集過度的保護或一些項目集沒有做到充分的保護。此外，若競爭對象探勘時所使用的支持度門檻值小於我們原先在隱藏所使用的支持度門檻值，那公佈出去的資料庫將會洩露所有的敏感資訊。相反地，如果隱藏資料庫時所使用的門檻值太小，那公佈的資料庫將有可能遺失太多的資訊，而變成沒有利用價值，使得接下來的探勘工作無法運行。此外，若一些較一般的項目及特殊的項目出現在資料庫中的頻率很近似的話，此時競爭對象就會推測有一些敏感的資訊已被隱藏。

因此，基於隱私保護及資訊保留的考量下，給予每個項目集一個特定的門檻值是很重要的。由於上述原因，在 [3] 中提出了使用透露門檻值 (Disclosure thresholds)的演算法。他是根據敏感項目集在資料庫中的分佈來降低其支持度。

也使用一個透露門檻值直接控制隱私和資訊保留之間的平衡。然而，此方法沒有考慮不同的敏感項目集在不同的應用上的特性，或是不同使用者的個人需求。

而完全是依賴資料庫的分佈來對非高頻項目集及高頻項目集做淨除動作。

在本計畫中，我們加入多門檻值的設定，考量在多門檻值下頻繁項目集的特性，提出一個新的淨除演算法來保護機密的資訊，使得對敏感資訊的保護能更切合於它本身的特性，並藉此達到在修改過後的資料庫內保留更多的非機密資訊，

亦即降低資訊遺失的程度。

三、文獻探討

隱藏高頻項目集和關聯法則一開始是由[4]所提出，作者證明了要找到一個最佳的淨除方法是一個 NP-hard 的問題，同時也提出了一個啟發式的方法藉由刪除資料庫中交易資料的項目來達到隱藏敏感高頻項目集。近年來，越來越多的研究學者開始注意到隱私的議題。後來的方法大致可分為以下兩種策略：資料修改 (data modification)及資料重建(data reconstruction)。

資料修改：這一群演算法的概念是變動原資料庫使得敏感資訊沒辦法在新資料庫中被探勘出來。這些演算法選擇部份項目成為移除項目(victim item)，移除或是插入它們到一些交易資料中[5]。在[3]中，作者發表了一個新的方法，此方法使用了透露參數(disclosure parameter)來取代支持度門檻值直接控制隱私要求及資訊保留之間的平衡。以透露參數作為基礎，每個敏感項目集的支持度都被降低相同的比例。IGA 演算法[4]一開始先將敏感項目集做分群，接著基於減低對資料庫副作用的考量來選擇移除項目，在[6]這個論文中，提出 border-based 的概念來有效率的評估任何修改對資料庫的影響。藉由這種為了降低副作用的貪婪選

(7)

3

擇方式，使得資料庫的品質及相關聯的高頻項目集能夠被完好的維護。在[4]中，

作者提出一個演算法可以用來抵擋向前推論攻擊(forward inference attack)。藉由乘上原資料庫矩陣，然後一起作一次的淨除。這個方法擁有更好的效能。最近的研究大部份都是注重在最小化對資料庫的副作用[7][8]。這些方法分別減少了對交易資料及項目的修改來限制對資料庫所產生的副作用。

資料重建：這一類演算法的動機在於先前基於資料庫作修改的方法花太多時間在掃描資料庫上而且也不能直接控制公佈資料的資訊。因此資料重建這一類演算法，使用基於資訊考量的方法，直接重建包含資料庫擁有者所要保留的資訊的新資料庫，再將其公佈出去。一般的情，在原資料庫中的高頻項目集是被視為一種知識。反向高頻項目集探勘(inverse frequent set mining)問題在[9]中第一次被提出，也被證明為是一個 NP-hard 的問題。因而，大部份的學者應用這樣的概念到隱私的議題和演算法的基準[10]。在[11]中，作者提出一個 constraint inverse itemset lattice mining 技術來自動產生簡單的、可公佈的且可分享的資料集。他指出如果存在一個可行的支持度集合，他們可以藉由一對一的對映產生包含一些高頻項目集的新資料庫。在[12]中，作者提出 FP-tree-based 的方法用來做 inverse frequent set 探勘。且新產生的資料庫完全滿足全部所給的限制。然而，此方法沒有提供完整和良好的隱藏。他只有控制非敏感項目集的支持度數量跟原資料庫中的一模一樣，但出現頻率並沒有滿足原來的限制。針對此議題，現在最重要的問題是要如何找出可行的支持度集並且對映到適合的資料集。

此外，多重支持度門檻值的概念在[13]第一次被提出。由於觀察到的現象，

發現不同項目集的支持度在實際情況下不會完全相同。我們使用一些已有的多重 支持度門檻值的規範來當作衡量我們演算法的基準，會在下一章介紹。

四、研究方法

淨除演算法

圖表 1：淨除演算法架構

我們的淨除演算法架構如圖表 1 所示，主要包含了三個元件：敏感項目集表

(8)

4

(sensitive pattern table)、樣式表(template table)、淨除動作表(action table)。第一、

先藉由掃描資料庫以取得敏感項目集的支持度(support)以及其敏感的交易資料。接著再依據最小門檻值建立一個敏感項目表來儲存每一個敏感項目集所需降低的支持度數量。第二、我們對每一個敏感項目集產生相對應的樣式表，每個樣式表包含所有可能被選擇用來隱藏相對應敏感項目集的移除項目(victim item)。

再依減少對資料庫產生副作用的隱藏策略從樣式表中選擇一個樣式出來。接著找出對應的交易資料，其交易資料的數量足夠用來將此樣式所含蓋的敏感項目集給刪除，將此產生出的記錄放入淨除動作表。然後所有元件的資訊都會更新，以上兩個步驟選擇以及更新會一直重覆直到所有的敏感項目集都被隱藏起來。最後我們依據淨除動作表找出相對應的交易資料及移除其指定的移除項目來完成淨除，整個淨除演算法只需要掃描資料庫兩次。

敏感項目集表

(a) 敏感項目集表 (b) 樣式表表格 1：敏感項目集表及所對應的樣式表

在敏感項目集表中包含了兩個欄位，分別是敏感項目集(SP)和其所需的移除 數量(Count)，如表格 1(a)所示，敏感項目集的移除數量也表示了要讓該項目集成 為非頻繁項目集所需降低的最少支持度數量。

樣式表

一開始的樣式表是根據敏感項目集表所建立的，如表格 1 所描述的。一個樣 式是以下列的形式所表示的：< TPID, victim, UCP, SPC, MC >，其中 TPID 是樣 式的識別子(TemPlate unique IDentifier)，而 victim 是經過考慮要從相對應的交易 資料中選定一個移除項目。對於一個長度為 k 的敏感項目集，會有 k 個項目能夠 成為移除項目，因此我們可以對不同的項目產生 k 個樣式。以{1, 3, 4}為例，能 夠產生出分別對應項目{1}，{3}，{4}的樣式。樣式中的 UCP 必須包含在所選擇 用來刪除的交易資料中。同時也是所有能夠被此樣式淨除的敏感項目集的聯集。

這就意謂著當我們選出一個包含此 UCP 的敏感交易資料，且在此交易資料中刪 除移除項目，而每個所對應的敏感項目集支持度將會下降。舉例來說，在表格 1(b)中，TP1是刪除有包含{3}的交易資料中的{3}來隱藏{3}這個敏感項目集。TP₃

(9)

5

是刪除有包含{2, 3}的交易資料中的{3}來隱藏{2, 3}這個敏感項目集。SPC 是代 表可被此樣式所淨除的敏感項目集數量。舉例來說，像在表格 1(b)中的 TP₃可以 同時隱藏兩個敏感項目集{2, 3}、{3}，所以此樣式的 SPC 為 2。接著 MC 代表 當要隱藏此樣式所對應的敏感項目集時，所需要降低支持度的最少數量。因此，

樣式的 MC 為所有此樣式所對應的敏感項目集中的最大修改數量。例如在表格 1(b)中，TP3的 MC 為 max{3, 5}=5。

不只是以上所介紹的樣式，另外我們藉由產生接合樣式(joint template)來包 含更多的敏感項目集。若任兩個樣式，它們擁有相同的移除項目且各自的 UCP 沒有互相包含的話，我們可以將其接合起來成為一個新的樣式。而此新樣式的 SPC 及 MC 則會依據對應的敏感項目集計算得出。如表格 1 所示，TP3和 TP₅可以結合產生 TP₇，其 UCP 為{1, 2, 3 ,4}是{2, 3}與{1, 3, 4}的聯集。而 TP₇的 SPC 將會是 3，因為從包含{1, 2, 3, 4}的交易資料中移除{3}這個項目可以同時讓 3 個 敏感項目集的支持度下降，分別是{3}、{2, 3}及{1, 3, 4}。而 TP7 的 MC 為 max{3, 5, 6} = 6。因此，TP7成為比 TP₃、TP₅還要好的選擇，因為 TP₇的 SPC 大於 TP₃ 和 TP₅能夠同時隱藏更多的敏感項目集。

選擇策略及更新步驟

我們隱藏策略的本質為「使副作用達到最小」，我們在每一回合選擇 SPC 為 最大的樣式，如果有超過一個以上的樣式擁有相同的 SPC，就會去選擇擁有最小 MC 的樣式。若仍然還有超過一個以上有相同的 MC，那就選擇一個其移除項目 的支持度為所有資料庫中最小的樣式，最後若上述的所有情況無法解決的話就使用亂數的機制來挑選。在選完樣式之後，藉由一開始所建立的交易資料索引找出 所有相對應的敏感交易資料，如果找出的敏感交易資料數量大於所選樣式 MC，

就選擇前 MC 個較短的交易資料移至淨除動作表中進行淨除。反之所有相對應的 交易資料將會被淨除。接著根據所淨除的交易資料數量，更新 Count，MC 會被 重新計算。若有些敏感項目集被此樣式給隱藏起來，則 SPC 和 UCP 也會改變。

當一個樣式的 SPC 變成零的時候，我們從樣式表裡移除此樣式。

為了要提升演算法的效能，我們提出修改邊緣項目集(border itemests)的概念，以減少不必要的工作。由於頻繁項目集的單調特性，隱藏一個敏感項目集也會同時隱藏其擴展項目集(superset itemsets)，因此在淨除過程中，我們僅僅只需刪除沒有敏感子項目集(subset itemsets)的敏感項目集。這樣的一個項目集也被稱為邊緣項目集。例如，如果{1, 2, 3}、{1, 3}和{1}為三個敏感項目集，其中{1}就為邊緣項目集。只要敏感項目集中的邊緣項目集都被隱藏，我們就能保護所有的敏感項目集的資訊。然而這樣的技術不能應用在隱藏敏感項目集在多重最小門檻值的情況下，當一個敏感項目集的擴展集所設定的門檻值小於他自己的門檻值的話，只隱藏邊緣項目集並不能保證能保護所有的敏感頻繁項目集。以上面的例子來看，如果最小門檻值對應到{1, 2, 3}、{1, 3}和{1}，分別為 2、4 和 3。那麼{1}

(10)

6

和{1, 2, 3}就是需要修改的邊緣項目集。若這些邊緣項目集被隱藏，那麼所有敏感的資訊都能被保護。

五、結果與討論

在這一部份，我們要展示我們所提出淨除演算法的效能(performance)以及可擴展性(scalability)。我們使用支持度限制[14](support constraint)以及最大限制 [15](maximum constraint)來給定每一個敏感項目集不同的最小門檻值。除此之外，為了要能夠與項目群集演算法[16](Item Grouping Algorithm, IGA)做比較，我們讓演算法與 IGA 使用相同的透露門檻值(Disclosure thresholds)來呈現在單一最小門檻值狀況下的效能。這些限制如同以下說明：

透露門檻值：這是設計用來讓我們的方法能與 IGA 在單一支持度門檻值下比較。因此，最小門檻值設定如下方所示需與 IGA 相同：

α

×

=s ( ) )

(X up X st

α的設定相同於 IGA 所使用的透露門檻值。

支持度限制：與[14]所提出的相類似，我們一開始先將資料庫裡的支持度範 圍分割為若干區間，每一個區間有相同的數量的項目，以便所有的區間 B_i在第 i 個區間都能包含所有的項目。接著，在這個架構的支持度限制是由所有區間的可能組合而產生。支持度限制的支持度門檻值SC_k(B₁,...B_r)≤θ_k如以下所定義：

} 1 ), ( ...

) (

{r^k ¹ S B_i S B_r min × × ×

= ⁻

θ

其中S(B_i)代表在B 中項目支持度最小的數量，而_i r 是一個大於 1 的正整數。r 值 較大的話可以用來降低S(B_i)×...×S(B_r)減少的速度，我們可以改變 r 值來產生不 同的支持度限制。

最大限制：我們使用與[17]相同的公式，來給定每個項目不同的支持度門檻值。

⎩⎨

⎧ × × >

= otherwise.

, )

( if ) ) (

( minsup

minsup X

sup X

X sup

st ρ ρ

其中 0 ≦ρ≦ 1，且 sup(i)表示在資料集中此項目 i 的支持度，若ρ設為零，所 有項目的最小支持度門檻值皆相同。那這一情況就與單一最小門檻值的情況一樣。

我們使用兩個不同特性的資料集分別是 accidents [18]和 kosarak，並且使用透露門檻值的方式與 IGA 比較我們的方法。另一方面，考慮到探勘多重門檻項目集的時間復雜度，不失其一般性，兩個較小的實際資料集 chess 和 mushroom [19]

也用來衡量我們的隱藏方法。

(11)

7

我們採用兩種衡量指標，資訊遺失(information loss, IL)和隱藏失敗(hiding failure)來評估我們刪除策略的效能。資訊遺失(IL)是指在淨除過程中被隱藏的非 敏感項目集所占的百分比，如同下列式子所示：

|)

|

| (|

|))

|

| (|

|)

|

| ((|

s

s s

P FP

P P F P

IL FP

−

− ′

= −

|

s s

P HF P′

=

在單一最小門檻值的情況下，P 包含任何一個敏感項目集的擴展集。然而在多重_s 最小門檻值的情況下，一個頻繁項目的子項目集不一定是高頻的。

首先，我們在單一最小門檻值的情況下，同時評估我們的演算法以及 IGA 效能。我們分別設定透露門檻值為 0.2 及 0.002 來隱藏在 accidents 及 kosarak 資料集中的敏感項目集。敏感項目集是由高頻項目集中隨機挑選出來，挑選出來的項目集其支持度分別大於 20%及 0.2%，隨後我們對新的資料庫分別以 10% 到 90%及 0.1% 到 0.28%的支持度門檻值探勘高頻項目集。資訊遺失的結果如圖表 2 所示。資訊遺失的趨勢與測試的資料集特性有很高的關聯性。我們可以觀察到我們的方法有達到較好的資料保護。大部份隱藏失敗都為零，除了將 accidents 的最小支持度門檻值設在 10%時，IGA 的隱藏失敗為 0.0057%而我們的方法則為 0.325%。

圖表 2：資訊遺失與透露門檻值

我們在相同的資料庫以及相同的敏感項目集下與 IGA 比較執行時間以及可擴展性，。且兩個方法的透露門檻值參數α都設為零，當此參數設為零時，代表要完全隱藏。我們隱藏 6 個長度為 2~6 且彼此互斥的敏感項目集，同時也改變資料集的大小由 10 萬筆資料到 90 萬筆。結果如圖表 3(a)所示。接著我們再改變所隱藏的敏感項目集數量由 1 個敏感項目集到 10 個敏感項目集，項目集都是隨機挑選的。其結果如果圖表 3(b)所示。我們可以觀察到執行時間與資料庫的資料筆數以及敏感項目集數量呈線性關系。可得知我們的方法如同 IGA 達到了很好的可擴展性。並且實現了更好的資訊保護及提供了能夠隱藏敏感項目集在多重最

(12)

8

小門檻值下的能力。資訊遺失的結果如圖表 4(a)及 4(b)所示，由於從 chess 資料集中所選出的敏感項目集其支持度較高，使得要隱藏一個項目集需要有更多的項目被刪除，導致資訊遺失量變高，所有的隱藏失敗皆為零。我們可以觀察到資訊 遺失會隨著最大限制 r 值的下降以及所需隱藏敏感項目集數量的增加而上升。

圖表 3：在 kosarak 資料集下與 IGA 做比較

圖表 4：資訊遺失與最大限制

我們介紹了隱藏高頻項目集在多重最小門檻值下的概念，提出新的，在多重最小門檻值的情況下的隱藏演算法，此隱藏演算法也可適用在實際的應用中。我們按照過去的經驗藉由一連串實驗來驗證我們方法的效能、可擴展性。我們的實驗結果展現出我們的方法是有效的且相較於 IGA 在單一支持度門檻值下有明顯的改進。此外我們能在多重最小門檻值下隱藏資料庫的敏感資訊。

參考文獻

[1] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufmann, 2006

[2] Clifton, C., Marks, D.: Security and Privacy Implication of Data Mining. In: ACM SIGMOD Workshop on Data Mining and Knowledge Discovery, pp. 15–19 (1996)

[3] Oliveira, S.R.M., Za´ıane, O.R.: A Unified Framework for Protecting Sensitive Association Rules in Business Collaboration. Int. J. of Business Intelligence and Data Mining 1(3), 247–287 (2006)

[4] Atallah, M., Bertino, E., Elmagarmid, A., Ibrahim, M., Verykios, V.: Disclosure Limitation of Sensitive Rules. In: Proc. of the IEEE Knowledge and Data

(13)

9

Exchange Workshop, pp. 45–52 (1999)

[5] Verykios, V.S., Elmagarmid, A., Bertino, E., Saygin, Y., Dasseni, E.: Association Rule Hiding. IEEE Transactions on Knowledge and Data Engineering 16(4), 434–447 (2004)

[6] Xingzhi, S., Yu, P.S.: A Border-Based Approach for Hiding Sensitive Frequent Itemsets. In: Proc. of 5th IEEE Int. Conf. on Data Mining, pp. 426–433 (2005) [7] Wu, Y.H., Chiang, C.M., Chen, A.L.P.: Hiding Sensitive Association Rules with

Limited Side Effects. IEEE Transactions on Knowledge and Data Engineering 19(1), 29–42 (2007)

[8]Gkoulalas-Divanis, A., Verykios, V.S.: An Integer Programming Approach for Frequent Itemset Hiding. In: Proc. of Int. Conf. on Information and Knowledge Management, pp. 748–757 (2006)

[9] Mielikainen, T.: On Inverse Frequent Set Mining. Proc. of the 2nd IEEE ICDM Workshop on Privacy Preserving Data Mining (2003)

[10] Wu, X., Wu, Y., Wang, Y., Li, Y.: Privacy-Aware Market Basket Data Set Generation: A Feasible Approach for Inverse Frequent Set Mining. Proc. 5th SIAM Int. Conf. on Data Mining (2005)

[11] Chen, X., Orlowska, M., Li, X.: A New Framework of Privacy Preserving Data Sharing. Proc. of IEEE 4th Int. Workshop on Privacy and Security Aspects of Data Mining (2004) 47–56

[12]Guo, Y.: Reconstruction-Based Association Rule Hiding. Proc. of SIGMOD 2007 Ph.D. Workshop on Innovative Database Research (2007)

[13] Liu, B., Hsu, W., Ma, Y.: Mining Association Rules with Multiple Minimum Supports. Proc. of the 5th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (1999) 337–341

[14]Wang, K., He, Y., Han, J.: Pushing Support Constraints into Association Rules Mining. IEEE Transactions on Knowledge and Data Engineering 15(3) (2003) 642–658

[15] Lee, Y.C., Hong, T.P., Lin, W.Y.: Mining Association Rules with Multiple Minimum Supports Using Maximum Constraints. Int. Journal of Approximate Reasoning on Data Mining and Granular Computing 40(1–2) (2005) 44–54 [16] Blake, C.L., Merz, C.J.:UCIRepository of machine learning databases

[http://www.ics.uci.edu/ mlearn/MLRepository.html]. Irvine, CA:University of

(14)

10

Califarnia, Dept. of Inf. and CS., (1998)

[17] Liu, B., Hsu, W., Ma, Y.: Mining Association Rules with Multiple Minimum Sup- ports. Proc. of the 5th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (1999) 337–341

[18]Geurts, K., Wets, G., Brijs, T., Vanhoof, K.: Profiling High-Frequency Accident Locations Using Association Rules. Proc. of the 82th Annual Transportation Research Board (2003) 18

[19]Blake, C.L., Merz, C.J.:UCIRepository of machine learning databases

[http://www.ics.uci.edu/ mlearn/MLRepository.html].Irvine, CA: University of Califarnia, Dept. of Inf. and CS., (1998)

計畫成果自評

本計畫與原計畫書中第一年之進度大致相符，我們針對多重支持度門檻值設計一個新的淨除演算法來保護機密的頻繁項目集，使得對敏感資訊的保護能更切合於它本身的特性，並藉此達到在修改過後的資料庫內保留更多的非機密資訊，

亦即降低資訊遺失的程度。由於過去大部份的敏感項目集淨除演算法都是將所有的敏感項目集降低到單一的門檻值下，但這樣子情況並不適合運用在實際的應用上。然而，此敏感項目集淨除演算法，能夠依照敏感項目集在資料庫中的分布，

將其各自的支持降低到所指定的不同門檻值底下。此研究成果可實作在資料庫上，在公佈資料庫時應用此方法，來產生具隱私保護的資料庫。部分研究成果已發表於國際會議：

Ya-Ping Kuo, Pai-Yu Lin and Bi-Ru Dai, "Hiding Frequent Patterns under Multiple Sensitive Thresholds," Proceedings of the 19th International Conference on Database and Expert Systems Applications (DEXA 2008), Turin, Italy, September 1-5, 2008.

(Lecture Notes in Computer Science 5181 Springer 2008, ISBN 978-3-540-85653-5) (EI)

(15)

附件一

Hiding Frequent Patterns under Multiple Sensitive Thresholds

Ya-Ping Kuo, Pai-Yu Lin, and Bi-Ru Dai

Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan. R.O.C.

{m9515063,m9615082}@mail.ntust.edu.tw, [email protected]

Abstract. Frequent pattern mining is a popular topic in data mining.

With the advance of this technique, privacy issues attract more and more attention in recent years. In this field, previous works based hiding sensitive information on a uniform support threshold or a disclosure threshold.

However, in practical applications, we probably need to apply different support thresholds to different itemsets for reflecting their significance.

In this paper, we propose a new hiding strategy to protect sensitive frequent patterns with multiple sensitive thresholds. Based on different sensitive thresholds, the sanitized dataset is able to highly fulfill user requirements in real applications, while preserving more information of the original dataset. Empirical studies show that our approach can protect sensitive knowledge well not only under multiple thresholds, but also under a uniform threshold. Moreover, the quality of the sanitized dataset can be maintained.

Keywords: privacy, frequent pattern hiding, multiple threshold, sensitive knowledge, security, data sanitization.

1 Introduction

Frequent pattern and association rule mining play the important roles in data mining [1]. By this technique, we can discover interesting but hidden information from database. This technique has been applied to many application domains, such as the analysis of market basket, medical management, stock, environment, business, etc., and brings great advantages. However, most database owners are unwilling to supply their datasets to analysis, since some sensitive information or private commercial strategies are at the risk of being disclosed from the mining result. Therefore, although many benefits can be provided by this technique, it causes new threats to privacy and security. For above reason, the database should be processed before releasing so that it can contain the most of original non-sensitive knowledge and the least of sensitive information for the owner.

Intuitively, the database owner can permit only partial access of dataset for analysis or directly remove all sensitive information from the mining result of database. However, it is possible that the adversary still can infer sensitive itemsets or high-level items from non-sensitive patterns or low-level items. For exam- ple, suppose that {1} is the sensitive pattern and the set of all frequent patterns

(16)

2

are {{1}, {1, 2}, {2, 3}}. If we directly remove {1} and release {{1, 2}, {2, 3}}, the adversary may still be able to infer that {1} is frequent. That is because of the monotonic property of frequent patterns, which means that all non-empty subsets of a frequent pattern must be frequent. Hence the challenge is how to protect sensitive information from being attacked by inference.

1.1 Motivations

In this paper, we focus on the problem of hiding sensitive frequent itemsets from a transaction database. The motivation and the importance of hiding sensitive itemsets have been well explained in [2] as stated below. Suppose that most people who purchase milk usually also purchase Green paper. If the Dedtrees paper company mines this rule from the database of a supermarket and issues a coupon, “if you buy the Dedtrees paper, you will get a 50 cents off discount of one milk,” then the sales of Green paper will be reduced by the above commercial strategy. For this reason, the Green paper company would not like to provide a lower price to the supermarket. On the other hand, the Dedtrees paper company has already achieved its goal, and is unwilling to provide a lower price to supermarket anymore. Then, the supermarket will suffer serious losses. Hence the database should be sanitized for such sensitive information before releasing.

Most of previous sanitization algorithms only use one user-predefined support threshold without considering the following issues. First, using a uniform support threshold with different patterns is not always reasonable in real life.

For instance, the supports of high price or the latest products, such as com- puters, are intrinsically lower than those of general or common products, such as water, but it does not imply the latter ones are more significant than the former ones. If we decrease the supports of all sensitive itemsets to be smaller than the same threshold, it may cause some itemsets are overprotected and some are not protected sufficiently. Furthermore if the support threshold used by the adversary in mining is smaller than the one used in hiding, the released database will disclose all sensitive information. On the contrary, if the support threshold which is used for hiding is too small, the released database is possible to lose too much information and becomes useless for subsequent mining. In addition, if the general items and the particular items have similar frequencies in database, the adversary will infer that some sensitive knowledge has been hidden. Therefore, based on the consideration of both privacy protection and information preservation, it is important to assign each itemset a particular threshold. For the above reasons, an algorithms using a disclosure threshold has been proposed [17]. It decreases the supports of sensitive patterns according to their distribution in database and uses a disclosure threshold directly to control the balance between privacy and knowledge discovery. However, the method does not consider the characteristics of different sensitive itemsets in different applications or the per- sonalized requirements of different users. It totally relies on the distribution of database to do the same degree of sanitization with infrequent sensitive patterns and more frequent sensitive patterns.

(17)

3

1.2 Contributions

In this paper, we propose a new strategy, which combines the sanitization algorithm and the concept of multiple support thresholds [14], to solve the problem which is mentioned above. Before sanitizing, the database owner can specify the support threshold, called sensitive threshold, for each sensitive pattern based on his/her domain knowledge. Then our algorithm will decrease the support of each sensitive pattern to be below its sensitive threshold, respectively. Under multiple thresholds, the database owner can directly decide the sanitization degree of different patterns hence the protected database can much more satisfy the demand of the database owner. Consequently, the proposed strategy is able to reduce the probability of privacy breach and preserve as much information as possible.

The main contributions of this paper are as follows: (1) a new hiding strategy with multiple sensitive thresholds, which is more applicable in reality, is suggested; (2) the proposed algorithm can achieve better privacy protection and information preservation; (3) The new metrics are presented to measure performance for hiding frequent patterns under multiple sensitive thresholds because of the difference between under multiple thresholds and a uniform one, while the sets of the patterns which need to be hidden by user-predefined are the same.

The rest of this paper is organized as follows. The preliminary knowledge is stated in Section 2. In Section 3, we introduce our sanitizaion framework, the whole sanitization process, and some techniques which improve performance and efficiency. Some related hiding algorithms are reviewed in Section 4. The new metrics, experimental results and discussion are presented in Section 5. In the last part, Section 6 presents our conclusions.

2 Preliminaries

Before presenting our hiding strategy and framework, we introduce the preliminaries of frequent patterns, the transaction database, and the related concepts of privacy and multiple thresholds briefly.

Frequent pattern and Transaction database. Let I = {1, . . . , n} be a non-empty set of items. Each non-empty subset X ⊆ I is called a pattern or an itemset. A transaction is a pair of itemset t ⊆ I with a unique identifier Ti, called the transaction identifier or TID. A transaction database D = {T1, . . . , TN} is a set of transactions, and its size is |D| = N . We assume that the itemsets and the transactions are ordered in lexicographic order. A transaction t sup- ports X, if X ⊆ t. Given a database D, the support of an itemset X, denoted sup(X), is the number of transactions that support X in D. The frequency of X is sup(X)/|D|. An itemset X is said to be frequent if sup(X)/|D| is larger than the user-predefined minimum support, denoted as minsup, 0 ≤ minsup ≤ 1.

Sensitive itemset, sensitive threshold, and sensitive transaction. Let D be a transaction database, FP be a set of frequent patterns, and {sp1, . . . , spi}

∈ SP be a set of patterns that need to be hidden based on some security require- ments. A set of frequent patterns which are able to infer any patterns in SP,

(18)

4

I and X I is the set of all items; X is an itemset, X ⊆ I (Ti, t) The transaction itemset t with its TID Ti

sup(X) The support of X

minsup The user-predefined minimum support threshold sp1, . . . , spi∈ SP The set of patterns that need to be hidden

Ps The set of sensitive patterns which can infer any patterns in SP Ts(X) The set of sensitive transactions of X

st(X) The sensitive threshold of X TPk The unique identifier of template

SP C The number of sensitive patterns covered of a template M C The minimal count of the transactions need to be modified

Table 1: The summarization of notations used in the paper

denoted as Ps, is said to be sensitive. ∼ Ps is the set of non-sensitive frequent patterns such that Ps∪ ∼ Ps= FP. As long as a transaction supports any item- sets, it is said to be sensitive, denoted as Ts, and the set of sensitive transactions of X is denoted as Ts(X). The support threshold used for hiding is named sen- sitive threshold. The sensitive threshold of a sensitive pattern X is denoted as st(X).

3 The Template-Based Sanitization Process

The main goal of this work is to hide sensitive information in database so that the frequent itemset mining result of new released dataset will not disclose any sensitive patterns. The challenge of this problem is to find out the balanced so- lution between the privacy requirement and the information preservation. We suggest assigning different sensitive threshold to each sensitive itemset to mini- mize the side effects with the dataset. Formally, the problem definition is stated as follows:

Frequent pattern hiding with multiple sensitive thresholds. Given a database D and the set of patterns to be hidden, {sp1, . . . , spi} ∈ SP, with their sensitive thresholds, st(sp1), . . . , st(spi), the problem is how to transform D to D⁰ such that Ps will not be mined from D⁰, and ∼ Ps can still be contained in D⁰. Finally, D⁰ can be released without violating the privacy concern.

In our sanitization process, we remove some items for each sensitive itemset from its corresponding sensitive transactions. Since a uniform sensitive threshold is usually not suitable for real cases, we apply the concept of multiple sensitive thresholds for hiding sensitive itemsets so that the itemsets with higher occurrences in reality can preserve more information. Moreover, the itemsets with lower occurrences in reality can reach better protection. Note that we will not focus on the determination of sensitive thresholds since they largely depend on applications and users requirements. Database owners can decide the sensitive threshold based on existing schemes [8][14][18] or any preferred settings.

(19)

5

Sensitive Pattern

Transaction Index

Original Database

New Database

Template Table

Action Table Threshold

Sensitive Calculate

Inverted file

Mining

Build/

Update

Retrieve

Modify & Output Fig. 1: The sanitization framework

In this section, our framework and the whole sanitization process of hiding frequent patterns are presented. We propose a template-based framework which is similar as [5] but different strategy on choosing an optimal hiding action with minimal side effect. The proposed method hides the sensitive itemset by decreasing its support. We apply the template to evaluate the impact of choosing different items to be victims and different hiding order of sensitive itemsets.

In order to reach minimum side effect, we would like to choose the optimal modification of a template which can hide most sensitive itemsets and sanitize least sensitive transactions at the same time. In addition to promote efficiency, we suggest a revised border-based method to reduce the redundant work on hiding and rely on the inverted file and pattern index to speed up the renovation of each component in our sanitization process. The summarization of notations used in this paper is shown as Table 1.

3.1 The Sanitization Framework

The framework of our sanitization process is illustrated as Fig. 1. It mainly consists of three components: sensitive pattern table, template table, and ac- tion table. At first, the database is scanned to find all supports and sensitive transactions of sensitive itemsets, and then the sensitive pattern table is built that stores the number of supports should be decreased based on the sensitive threshold for each sensitive itemsets. Secondly, we generate the corresponding templates for each sensitive itemset that contains all probable choices of victim items for hiding this itemset. Next, a template is selected from template table according to the hiding strategy of minimizing side effects for the original database. Then we search out the corresponding sensitive transactions enough to be modified for hiding all sensitive patterns covered by this template and then put all pairs,(victimitem, T ID), to action table. Then, the information of all components is updated. The choosing and updating process will repeat until all sensitive itemsets are hidden. Finally, we remove each victim item from its pair transaction in the action table. Note that the whole sanitization framework only needs to scan the database twice.

(20)

6

Sensitive Pattern Table

SP Count

{3} 3

{2, 3} 5

{1, 3, 4} 6

Template Table victim U CP SP C M C

TP1 3 3 1 3

TP2 2 2,3 1 5

TP3 3 2,3 2 5

TP4 1 1,3,4 1 6

TP5 3 1,3,4 2 6

TP6 4 1,3,4 1 6

TP7 3 1,2,3,4 3 6

Table 2: The sensitive pattern table and the corresponding template table

3.2 Sensitive Pattern Table

There are two attributes contained in the sensitive pattern table, each sensitive itemset and its Count, as shown in Table 2. The Count of a sensitive pattern indicates that the minimal number of support which is required to be decreased will make this pattern to be infrequent. Based on multiple sensitive thresholds, we propose the lemma of Count as follow:

Lemma 1. Given a sensitive pattern spi, the minimal number of transactions that should be sanitized for hiding this pattern is computed as spi.Count = bsup(spi) − st(spi) + 1c

Proof. To hide a sensitive pattern spi, its support, sup(spi), should be decreased to be below its sensitive threshold, st(spi). Hence removing some victim items contained in sp_i from the corresponding sensitive transactions will make sp_i to be infrequent. Let spi.Count be the minimal number of sanitized transac- tions as sup(spi) < st(spi). Because sup(spi) − spi.Count < st(spi), and then spi.Count > sup(spi) − st(spi). Therefore spi.Count = the interger part of ((sup(spi) − st(spi)) + 1) = bsup(spi) − st(spi) + 1c ut

3.3 Template Table

The initial template table should be built according to the sensitive pattern table, as depicted in Table 2. A template is represented in the form: < TPID, victim, U CP, SP C, M C >, where TPID is the template unique identifier, and victim is the chosen item that is considered to be removed from the corresponding sensitive transactions. For a sensitive itemset with length k, there are k items that can be victims. Hence we can generate k templates with different victims. Take {1, 3, 4}

as example, three templates with the victims, {1}, {3}, and {4} are produced, respectively. The U CP of a template represents the itemset must be contained in the corresponding transactions and it is the union of all corresponding sensitive patterns which can be sanitized by this template. It means that if the victim is deleted from the corresponding sensitive transactions which contain the U CP , the support of each corresponding sensitive pattern is decreased. For instance in

(21)

7 Table 2, TP1for hiding {3} is to delete {3} from the transactions containing {3};

TP3 for hiding {2, 3} and {3} is to delete {3} from the transaction containing {2, 3}, etc. The SP C, stands for the number of the sensitive patterns which can be sanitized by this template. For example, the TP3 in Table 2 can hide two sensitive patterns {3} and {2, 3} at the same time, so its SP C is 2. The M C indicates the minimal number of the support should be decreased, such that all corresponding sensitive patterns of this template are hidden. Hence the M C is the maximum Count among all corresponding sensitive patterns of this template. For instance, the M C of TP3 in Table 2 is max{3, 5} = 5.

Not only are those templates introduced above, but also we generate joint templates to cover more sensitive patterns. If any two templates have the same victim, and their U CP do not contain each other, we can join them to be a new template. The U CP of the new joint template is the union of the U CP s of all combined templates. Then the SP C and the M C are computed according all corresponding sensitive patterns. As shown in Table 2, TP3 and TP5can be combined to generate TP7. The U CP , the union of {2, 3} and {1, 3, 4}, is {1, 2, 3, 4}. Then the SP C of TP7will be 3 because removing {3} from transaction containing {1, 2, 3, 4} can decrease the supports of three sensitive patterns, {3}, {2, 3}, and {1, 3, 4}. The M C of TP7 is max{3, 5, 6} = 6. Consequently, TP7

becomes a better choice than TP3and TP5because the SP C in it is larger than the others, thus TP7can hide more patterns at the same time. We use the hash table to avoid generating the same template with existing ones, and transfer the pattern index from binary to decimal to be the hash key. We can compute the SP C and the M C of all templates refer to the sensitive pattern table, and the computation algorithm is similar in [5].

3.4 Choosing Strategy and Updating Process

Based on the essence of our hiding strategies - “minimizing side effect”, we choose the template having the largest SP C at each round. If there exists more than one template having the same SP C, the template with the smallest M C is selected. If there still exists more than one, we choose the template which has the victim with the lowest support in the database. Finally, if the tie is still not solved, a random choice will be picked. After choosing the template, the corresponding sensitive transactions are found out by the transaction index. If the number of the sensitive transactions is larger than the M C of the chosen template, we choose the first M C shortest transactions to move to action table for sanitizing; otherwise all corresponding sensitive transactions will be sanitized.

By the number of sanitized transactions, the Count and the M C are recomputed.

If some patterns are hidden by this template, the SP C and the U CP should be changed. As the SP C of a template becomes zero, we remove this template from template table. Lastly, the TID of sanitized transactions are removed from the transaction index of the corresponding sensitive patterns of this template.

In order to achieve better hiding performance, the number of the victim items in one transaction is not restricted.

(22)

8

3.5 Performance and Efficiency Improvement

In order to promote the performance and efficiency of our framework, we propose the concept of revised border itemsets for reducing redundant work. Because of the monotonic property of frequent patterns, hiding a sensitive pattern will hide all supersets of this pattern. Hence in the sanitization process, we merely need to hide the sensitive patterns which have no sensitive subsets. Such itemset is said to be a border itemset. For instance, if the {1, 2, 3}, {1, 3} and {1} are three sensitive patterns, then {1} is a border itemset but {1, 2, 3}, {1, 3} are not. As long as the border itemsets of the sensitive patterns have been hidden, we can protect all sensitive information. However this technique cannot be applied to the situation of hiding frequent patterns with multiple sensitive thresholds. While the sensitive threshold of the super-itemset is smaller than that of the itemset, focusing on hiding the border itemset cannot guarantee the protection of all sensitive frequent patterns. Therefore, we take all the itemsets which have no sensitive subsets with lower support thresholds than themselves to be the revised border itemset. In the above example, if the sensitive thresholds of {1,2,3}, {1,3}, and {1} are 2, 4, and 3, respectively, then {1} and {1,2,3} are revised border itemset. If the revised border itemsets are hidden, all the sensitive knowledge will be protected. In addition, the techniques of pattern index [5] and inverted file are also applied in our framework for increasing efficiency.

4 Related Works

The problem of hiding frequent patterns and association rules was proposed in [9] firstly. The authors proved that finding an optimal sanitization for hiding frequent patterns is an NP-hard problem and proposed a heuristic approach by deleting items from transactions in the database to hide sensitive frequent patterns. In recent year, more and more researchers start paying attention to privacy issues. The consequent approaches can be classified into two categories:

data modification, and data reconstruction.

Data modification. The main idea of this group is to alter original database such that the sensitive information is not able to be mined in new database. So as to decrease the support or confidence of sensitive rules below the user-predefined threshold, these algorithms choose some items as victims and delete or insert them in some transactions [6]. In [17], the authors present a novel approach using a disclosure parameter instead of the support threshold to directly control the balance between privacy requirement and information preservation. Based on the disclosure parameter, the support of each sensitive pattern is decreased by the same proportion. The proposed IGA algorithm groups sensitive patterns first, and then chooses the victim items based on the minimal side effects for database. In [7], the border-based concept was proposed to evaluate the impact of any modification on the database efficiently. The quality of database and relative frequency of each frequent itemset can be well maintained by greedily selecting the modifications with minimal side effect. In [4], the authors propose