• 沒有找到結果。

一個用於多維度關聯規則探勘系統的智慧型查詢助理

N/A
N/A
Protected

Academic year: 2021

Share "一個用於多維度關聯規則探勘系統的智慧型查詢助理"

Copied!
63
0
0

加載中.... (立即查看全文)

全文

(1)國立高雄大學電機工程學系 (研究所-記算機組) 碩士論文. 一個用於多維度關聯規則探勘系統的智慧型查詢助理 An Intelligent Query Assistance for a Multidimensional Association Mining System. 研究生:江長隆 指導教授:林文揚. 撰 博士. 中華民國九十九年七月.

(2) i.

(3) 誌 謝. 首先感謝學生的指導教授-林文揚博士,在跟隨老師的這段過程,無論學習、 研究亦或為人處世等方面,深受老師耐心無悔的引導與教誨,從無到有架構學生 身為研究生方面應具備之視野與能力;而在生活上更深受老師的關懷與協助,對 老師的感謝實無以言表。學生論文的另一推手:吳錦昂學姐,鉅細靡遺的觀念提 點與論文寫作修改,在學姐的大力幫助下讓學生得以完成論文,感銘五內。 另外感謝擔任招集委員的 曾守正老師,口試委員的曾新穆老師及洪宗貝老 師,對老師們撥冗前來給學生口試,並給予研究及發展上的寶貴指點與鼓勵,在 此感謝。而在這段時間,參與洪宗貝老師的Group Meeting更帶給學生多方面的見 識與思考方向,收穫良多。 家人更是支持學生前進的原動力,感謝母親吳尚美的體貼與鼓勵,讓學生堅 持學業,實現理想。此外感謝研究所的學長及同學們:明正學長、家輝學長及俊 豪學長對自己報告的分析或見解,有著腦力激盪的觸類旁通效果;科瑋、和益、佑 恩、子翔、彥合及為中的陪伴讓這段研究生活充實愉快且充滿活力。最後,向這 段日子曾經幫助、鼓勵自己的所有朋友致上深深的謝意。. 江長隆 2010/7/14 於 CIL LAB. ii.

(4) 一個用於多維度關聯規則探勘系統的智慧型查詢助 理 指導教授:林文揚 博士(教授) 國立高雄大學資訊工程所 學生:江長隆 國立高雄大學電機工程所. 中文摘要. 在大型資料庫進行的關聯探勘,有時會發現有趣的項目關聯。這些被挖掘出 來的關聯規則,在商業、科學、醫學等許多領域的決策過程是非常有用的。一個 多維關聯探勘允許探索的項目關聯於不同的屬性(維度) 。用戶可以透過對探勘項 目的興趣度、資料粒度以及其餘可用的過濾條件等設定來區分更精確的探勘資 料。因此,上述關聯探勘的積極行為將使探勘到的規則更接近使用者的需求。然 而,對於一個沒有經驗的用戶而言,想制定一個正確且有效的查詢,尤其是在設 置合理的門檻值,是一種挑戰。Apriori 演算法在探勘關聯規則時,需要使用者指 定一個最低支持度,以決定該項目集頻繁與否。所以,最低支持度是一個對挖掘 結果極具影響力的因素。不幸的是,對於支持度的門檻值設定是主觀且沒有明確 的標準。所以,使用者通常在探勘過程中反覆地嘗試錯誤,直到結果收斂到令人 滿意為止。 在這一篇論文,我們有兩個研究的主題。首先,我們實現一個使用者界面, 以表現出嵌入在 OntoWM 系統中的智能助理功能;OntoWM 是一個結合知識本體 之多維關聯探勘的系統架構。第二,我們建立一個適於使用者探勘強度的最低支 持度產生機制,並以其給予使用者建議。該方法乃應用過往的查詢紀錄所形成的 探勘歷史資訊,作為一案例式推導型式的資源。該系統藉由發現 K 個鄰近類似於 使用者探勘強度的查詢紀錄,彙整它們並從中獲得了有利的支持度範圍提供給使 用者參考。此外,我們還提出實驗以統計數據來呈現探勘過程中有無智能助理協 助對於查詢制定的差異。其結果顯示,有智能助理的協助,探勘過程會更有效率。 關鍵字: 關聯規則探勘、資料探勘查詢語言、智慧型助理、知識本體論。. iii.

(5) An Intelligent Query Assistance for a Multidimensional Association Mining System Advisor: Dr. (Professor) Wen-Yang Lin Department of Computer Science and Information Engineering National University of Kaohsiung Student: Chang-Long Jiang Department of Electrical Engineering National University of Kaohsiung. ABSTRACT Association mining discovers interesting associations among items in large data sets. These association rules mined can be helpful to decision making processes in the business, scientific, medical and many other fields. A multidimensional association mining allows exploration of associations of items in different attributes (dimensions). Users can specify more precisely the mining data via settings of interested mining attributes, data granularity and optional filtering conditions. Thus it is vigorous that the rules mined tend to be closer to what the users want. Yet, it is a challenge for an inexperienced user to formulate a correct and effective query, especially on the settings of reasonable thresholds. Apriori algorithm for mining association rule requires the users to specify a minimum support to determine if an itemset is frequent or not. The minimum support is an influential factor to the mining results. Unfortunately, the setting of support threshold is subjective without clear standard. The users usually do try-and-error repeatedly until the mining process converges to satisfactory results. In this thesis we have two research focuses. First, we implement an user interface to realize the intelligence assistance functions embedded in the OntoWM system, a system framework of multidimensional association mining incorporating the ontologies. Second, we develop a mechanism for generating a minimum support that is suitable for the user’s mining intension and suggest it to the user. The method utilizes the query log of the mining history which is the case-base reasoning like resource. The system finds from the query log the K-nearest neighbors of similar queries to the user’s mining intension, aggregates them and obtains the favorable support range for the user to refer. We also provide experiments with statistic data drawing from mining process with and without intelligent assistance in query formulation. The result shows that with iv.

(6) intelligent assistance, the mining process is more efficient. Keywords: multidimensional association mining, data mining query formulation, intelligent query assistance, ontology.. v.

(7) Contents 口試委員會審定書 ........................................................................................................ i 誌謝 ............................................................................................................................... ii 中文摘要 ...................................................................................................................... iii ABSTRACT ................................................................................................................. iv Contents ........................................................................................................................ vi List of Figures ............................................................................................................. viii List of Tables ................................................................................................................ ix Chapter 1 Introduction ................................................................................................... 1 1.1 Motivation ....................................................................................................... 1 1.2 Contributions ................................................................................................... 3 1.3 Thesis Organization ......................................................................................... 4 Chapter 2 Background Knowledge and Related Work .............................................. 5 2.1 Multidimensional Data Model ......................................................................... 5 2.1.1 N-dimensional Data Schema ................................................................ 5 2.1.2 Concept Hierarchies ............................................................................. 6 2.2 Association Mining Algorithms ...................................................................... 7 2.2.1 The Apriori Algorithm .......................................................................... 9 2.2.2 The FP-Growth Algorithm.................................................................. 10 2.3 Ontology ........................................................................................................ 11 2.4 Multidimensional Association Mining .......................................................... 11 2.5 Related Work ................................................................................................. 15 Chapter 3 The OntoWM System Framework ........................................................... 17 Chapter 4 The Method of Minimum Support Suggestion ........................................... 23 4.1 The Affection of Hierarchical Relationships ................................................. 23 4.2 The Method of Minimum Support Calculation ............................................. 26 4.2.1 Matching Transaction ID (t G ) ............................................................. 27 4.2.2 Matching Interested Mining Attributes (t M ) ....................................... 28 4.2.3 Matching “where” Condition (wc) ..................................................... 30 4.2.4 Aggregating Query Similarity and Suggesting the Minimum Support Range ..................................................................................................... 33 Chapter 5 Experimental Evaluation for Minimum Support Suggestion...................... 35 5.1 Experimental Environment ............................................................................ 35 5.2 Experimental Results ..................................................................................... 37 Chapter 6 Implementation of Minimum Support Suggestion in OntoWM ................. 40 6.1 Intelligent Query Assistance in OntoWM ............................................. 41 vi.

(8) 6.2 Intelligent Support Suggestion in OntoWM .......................................... 45 Chapter 7 Conclusions and Future Work ..................................................................... 47 7.1 Conclusions ........................................................................................... 47 7.2 Future Work........................................................................................... 48 References ................................................................................................................... 49. vii.

(9) List of Figures FIGURE 1. THE STRUCTURE OF A STAR SCHEMA............................................................... 6 FIGURE 2. CONCEPT HIERARCHICAL AND LATTICE STRUCTURES IN DIMENSIONS .......... 7 FIGURE 3. AN ILLUSTRATION OF THE SEARCH STRATEGY OF THE APRIORI ALGORITHM ... 9 FIGURE 4. THE CONCEPT OF FREQUENT-PATTERN GROWTH ALGORITHM ...................... 10 FIGURE 5. THE SYSTEM FRAMEWORK OF MULTIDIMENSIONAL ASSOCIATION MINING SYSTEM FRAMEWORK WITH INTELLIGENT ASSISTANCE IN QUERY FORMULATION .. 18 FIGURE 6. SCHEMA ONTOLOGY ..................................................................................... 20 FIGURE 7. A SCHEMA CONSTRAINT ONTOLOGY ............................................................. 21 FIGURE 8. THE DOMAIN ONTOLOGY EXAMPLE OF 3C PRODUCTION ............................... 21 FIGURE 9. AN EXAMPLE OF USER PREFERENCE ONTOLOGY ......................................... 22 FIGURE 10. QUERY LOG .............................................................................................. 22 FIGURE 11. CONCEPT HIERARCHIES OF TIME DIMENSION ........................................... 24 FIGURE 12. DATA DISTRIBUTION WITH ORIGINAL DATA GRANULARITY ...................... 25 FIGURE 13. DATA DISTRIBUTION WITH GRANULARITY BY DATE.................................. 25 FIGURE 14. DATA DISTRIBUTION WITH GRANULARITY BY MONTH ................................ 26 FIGURE 15. DATA DISTRIBUTION WITH T M = {PRODUCT, PRODUCT.SUBCATEGORY} .... 29 FIGURE 16 THE AVERAGE TIME WITH AND WITHOUT USING THE PROPOSED SUGGESTION MECHANISM .......................................................................................................... 38 FIGURE 17 THE AVERAGE ELAPSED TIME SPENT ON OBTAINING SATISFIED ASSOCIATION RULES WITH AND WITHOUT USING THE PROPOSED SUGGESTION MECHANISM ........ 39 FIGURE 18 THE FLOWCHART OF A QUERY FORMULATION AND ASSISTANCE PROCESS ..... 41 FIGURE 19 THE INCORPORATION OF SYNTAX WITH QUERY FORMULATION INTERFACE . 42 FIGURE 20 USER INTERFACE FOR QUERY FORMULATION .............................................. 43 FIGURE 21 OPERATION INTERFACE OF “WHERE” CONDITION ........................................ 43 FIGURE 22 SCHEMA CONSTRAINT : MINING ONLY ATTRIBUTE ........................................ 44 FIGURE 23 SCHEMA CONSTRAINT: GROUPING ONLY ATTRIBUTE .................................... 44 FIGURE 24 SCHEMA CONSTRAINT: EXCLUSIVE ATTRIBUTES .......................................... 45 FIGURE 25 SCHEMA CONSTRINT: CONCEPT HIERARCHY ................................................ 45 FIGURE 26 SUPPORT SUGGESTION METHOD FEEDBACK THE MINIMUM SUPPORT RANGE46. viii.

(10) List of Tables TABLE 1 AN EXAMPLE MINING DATA SET......................................................................... 8 TABLE 2. PART OF MULTI-DIMENSIONAL TRANSACTIONS ............................................... 14 TABLE 3. MULTI-DIMENSIONAL ASSOCIATION RULES DISCOVERED FROM TABLE 2 ........ 14 TABLE 4. ALL CASES OF Μ AND ℓ ................................................................................... 32 TABLE 5. DATA CHARACTERISTICS OF FOODMART2000 ................................................ 36 TABLE 6. TEST QUERY 1 (EASY) .................................................................................... 36 TABLE 7. TEST QUERY 2 (MEDIUM) ............................................................................... 37 TABLE 8. TEST QUERY 3 (HARD).................................................................................... 37. ix.

(11) Chapter 1 Introduction. 1.1 Motivation In the information era, how to use data efficiently to increase business competitive power is a non-reversible movement. Information technologies have provided many applicable solutions, but many decision analyses are still greatly dependent on users’ manual judgments; therefore, we expect that a good system should provide efficient, convenient and intelligent support mechanisms to meet the users’ professional demands. In other words, developing a system environment that provides the users with a mechanism to repeatedly discover important rules or patterns efficiently is absolutely an issue that researchers in the knowledge mining fields can not avoid.. Data mining has become a useful technology that helps discover previously unknown knowledge for entities with large data sets. Many different types of data patterns, such as classification, predication, clustering and association mining, can be discovered. Association mining discovers interesting associations among items in large data sets. These association rules mined can be helpful to decision making processes in the business, scientific, medical and many other fields. Market basket analysis is a typical example of association rule mining which explores relationships of customers’ purchasing products that reflect the customer’s consuming habits. When the association rule mining is first introduced, it implied typically on the retailer’s basket data, that is, it focuses on the purchase analysis which discovers the associations among solely the 1.

(12) product dimension. Later in [1, 2], multidimensional association mining from data cube or multi-dimensional data warehouse has been proposed. The market basket analysis involves items in the single attribute “Product”. A multidimensional association mining allows exploration of associations of items in different attributes. Users can specify more precisely the mining data via settings of interested mining attributes, data granularity and optional filtering conditions. Thus it is vigorous that the rules mined tend to be closer to what the users want. Yet, it is a challenge for an inexperienced user to formulate a correct and effective query, especially on the settings of reasonable thresholds. Usually, a data warehouse is very large with many dimensions and attributes. While formulating a query, a user might encounter the following challenges: (1) improper selections of attributes and/or conditional constraint settings due to unfamiliarity with data warehouse which might cause redundant rules or too few rules mined; (2) repetitive tuning of minimum support or minimum confidence. Apriori algorithm for mining association rule requires the users to specify a minimum support to determine if an itemset is frequent or not. The minimum support is an influential factor to the mining results. Unfortunately, the setting of support threshold is subjective without clear standard. The users usually do try-and-error repeatedly until the mining process converges to satisfactory results. If an intelligent assistance can be provided while the user is formulating a query, the mining efficiency and effectiveness can be improved.. In [19,20], our research group has proposed a system framework called OntoWM that incorporated with ontologies to provide intelligent support for query formulation in mining multidimensional associations rules from data warehouses. In this thesis we implement a user interface to realize the intelligence assistance provided by this system framework and also provide a method to generate a minimum support that is suitable for 2.

(13) the user’s mining intension and suggest it to the user. The method utilizes the query log of the mining history which is the case-base reasoning like resource. The system finds from the query log the K-nearest neighbors of similar queries to the user’s mining intension, aggregates them and obtains the favorable support range for the user to refer. We also provide experiments with statistic data drawing from mining process with and without intelligent assistance in query formulation. The data suggests that with intelligent assistance, the mining process is more efficient.. 1.2 Contributions In this thesis, we consider query similarity computation. By those functions, we can determine the top-N similar query Log records that contained the minimum support information. By this useful knowledge, we can generate the minimum support suggestion range with minimum support suggestion mechanism. That computation functions include: match transaction ID (t G ), matching interesting mining attributes (t M ) , and matching “where” condition (wc). We also implement the intelligent query assistant based on the system framework that is proposed by out research group. It contains syntax check, semantic check, and parameter suggestion. That assistance will substantially reduce the effort the users put on the mining process. The interactive operation to and fro in order to converge to satisfactory results is shortened according to the experiment we have conducted. The experimental results show that our support suggestion method is significantly reduced the user tuning minimum support times and the total mining time to find the satisfied mining results. Thus the efficiency of the mining process is improved.. 3.

(14) 1.3 Thesis Organization. The rest of this thesis is organized as follows. In Chapter 2, we introduce some background knowledge and related works. In Chapter 3 we present the multidimensional association rule mining system with intelligent assistance. Chapter 4 provides the method of minimum support suggestion and the experiment with data with and without intelligent assistance to show the satisfactory of the intelligent assistance in query formulation. In Chapter 5 we implement the intelligent system framework with assistance in query formulation and finally the conclusions and future work are provided in Chapter 6.. 4.

(15) Chapter 2 Background Knowledge and Related Work In this chapter, we will describe some background knowledge that is related to our research which includes multidimensional data model, association mining algorithm and multidimensional association rule mining.. 2.1 Multidimensional Data Model. The entity-relationship data model is commonly used in the design of relational databases, where a database schema consists of a set of entities or objects, and the relationships between them. Such a data model is appropriate for online transaction processing. Data warehouses, however, require a concise, subject-oriented schema which facilitates on-line data analysis. Data warehouses and OLAP tools are based on a multidimensional data model. In this section, we will review how to model N-dimensional data schema and also introduce the concept hierarchies among data.. 2.1.1 N-dimensional Data Schema. Multidimensional data model is a popular model for data warehouse systems which includes star schema, snowflake schema and fact constellation schema [15]. The star schema is the base of other schema which contains a large central table (fact table) and a 5.

(16) set of smaller attendant tables (dimension tables), one for each dimension. The fact table consists of numeric measures and some foreign keys as connection to the dimension tables. Each dimension table contains a set of attributes that represent the user’s business perspectives and are often organized in the form of a hierarchical relationship. Figure 1 is the structure of a star schema.. Dim 1 Dim1 ID Att 1.1 Att 1.2 .. Dim 2 Dim2 ID Att 2.1 .... Fact Table. Dim1 ID Dim2 ID Dim3 ID Measure 1 Measure 2. Dim 3 Dim3 ID Att 3.1 Att 3.2 ... Figure 1 The structure of a star schema. 2.1.2 Concept Hierarchies A concept hierarchy defines a sequence of mappings from a set of low level concepts to higher level concepts that are more general. For example, consider the concept hierarchy for the location dimension. City in the location dimension is mapped to the province or state to which it is contained. The provinces and states can in turn be mapped to the country to which they are contained. These mappings form the concept hierarchy of the dimension location. In the multidimensional model, data are organized into multiple dimensions and some of the dimension contains multiple levels of abstraction defined by concept hierarchies. This information provides users with the flexibility to view data from different perspectives. The concept hierarchy described 6.

(17) above is illustrated in Figure 2. Dim Product. Dim. Time. Product_family. Year. Product_category. Quarter of year. Dim. Location. Country State. Province. Week of year Product_subcategory. Month of Quarter. City. Date. Product_name. Figure 2 Concept hierarchical and lattice structures in dimensions. 2.2 Association Mining Algorithms. In the recent years, there have been many studies on efficient association rule mining [7, 25, 32]. An association rule has the format of A → B, where A and B are sets of items. It indicates that a transaction contains A tends to also contain B. The percentage of transactions that containing A also containing B, is called the confidence of the rule. The support of the rule is the percentage of the transactions containing both A and B. The support and the confidence of the association rule must satisfy user-specified minimum support and minimum confidence, respectively. In general, the process of association rule mining can be decomposed into two phases: (1) frequent itemset generation: find all frequent itemsets that satisfy the minimum support; and (2) rules construction: using all of the large itemsets to generate 7.

(18) association rules whose confidence exceeds the minimum confidence. For example, from Table 1 the following association rule is obtained, IBM60GB (A) → IBM TP (B) (sup = 40%, conf = 60%) The rule says that over 40% of the transactions have both “IBM60GB” and “IBM TP” bought together and over 60% of transactions that contain “IBM60GB” also contain “IBM TP”. Table 1 An example mining data set. TID. List of Items. 1. ABC. 2. AC D. 3. BC. 4. AC D. 5. ABC. 6. CDE. 7. BE. 8. AD. 9. ABC. 10. BCD. 11. ABC D. 12. AB D. 13. AC D. 14. BCD. 15. AB D. *A: IBM60GB. B: IBM TP C: RAM 512MB Cartridge. E: Hard Disk 8. D: Ink.

(19) In this section, we will introduce two most influential algorithms, the Apriori algorithm and FP-Growth algorithm.. 2.2.1 The Apriori Algorithm The Apriori algorithm proposed by Agrawal and Srikant [25] is a well-known algorithm for mining association rules. It is based on the anti-monotone property which is defined that if a k-itemset is not frequent, then any of its supersets is not frequent either. The Apriori algorithm adopts a bottom-up, level-wise search strategy for discovering all frequent itemsets, which means that if we want to discover the set of all k-itemsets, we need to do k-iterations. Each iteration is composed of two phases: (1) generate new candidate itemsets, and (2) count the supports of candidates to get frequent patterns. Figure 3 illustrates the search strategy performed by the Apriori algorithm over the example dataset in Table 1. Minimum Support = 0.4 = 6 / 15. <A, B, C, D>. <A, B, C, D, E>. <A, B, C, E>. <A, B, D, E>. <A, C, D, E>. <B, C, D, E>. <A, B, C : 4> <A, B, D> <A, B, E><A, C, D : 4><A, C, E> <A, D, E> <B, C, D> <B, C, E> <B, D, E> <C, D, E>. <A, B : 6> <A, C : 7> <A, D : 7> <A, E> <B, C : 7> <B, D : 5> <A, E> <C, D : 7> <C, E>. <A : 10>. <B : 10>. <C : 11>. <D : 10>. <D, E>. <E : 2>. Figure 3.An illustration of the search strategy of the Apriori algorithm 9.

(20) 2.2.2 The FP-Growth Algorithm The primary disadvantage of the Apriori algorithm is that is needs to compute candidate itemsets. To improve this drawback, Han et al. [32] proposed a novel algorithm called Frequent-Pattern Growth (FP-Growth) in which the candidate itemsets are not generated. This algorithm adopts a divide-and-conquer strategy that has two novel phases: (1) compress the dataset composed of only frequent items into a frequent pattern-tree and (2) divide the compressed frequent pattern-tree into a set of conditional FP-trees which is associated with a frequent item. Figure 4 depicts the main process of executing FP-Growth over the example in Table 1.. Transactions’ : 1 CAB 2 CAD 3 CB 4 CAD 5 CAB 6 CD 7 B 8 AD 9 CAB 10 C B D 11 C A B D 12 A B D 13 C A D 14 C B D 15 A B D. Minimum Support = 0.4 = 6 / 15. FP-Root:. Header Table: ID Support Link C 11 A 10 B 10 D 10 E 2. 1 B. 11 C. 4B. 7 A. 3B. 3 D. 2D. 1 D. 3 A 1 D. 2B 2D. 1D. Frequent Pattern Generation: D:(C,A,B):1à (C,A):3à (C,B):2à(C):1à(A):1à(A,B):2 èC:7, A:7, B:5è {DC, DA} B:(C,A):4à (C):3à ( ):1à(A):2èC:7, A:6è{BC,BA} A:(C):7à( ):3èC:7è{AC} C:( ):11 F. igure 4.The concept of Frequent-Pattern Growth algorithm. 10.

(21) 2.3 Ontology. The term ontology was originally a philosophy specifying the concept of reality. In recent years, ontologies have been adopted by the artificial intelligence realm as a knowledge expression, sharing and reuse tool. Assorted research defined the term “ontology” differently. According to W3C Web Ontology Working Group [33]: “An ontology formally defines a common set of terms that are used to describe and represent a domain knowledge.” T. Gruber [8] defines ontology as: “A set of definitions of content-specific knowledge representation primitives: classes, relations, functions, and object constants.” Thus, the use of ontology can build up a domain of knowledge. Through the use of ontologies, the relationships among concepts in the domain can be further analyzed. Ontologies can be used in many domains such as intelligence network agent, electronic commerce, knowledge management, data mining and so forth.. 2.4 Multidimensional Association Mining. An association rule has the format of A -> B, where A and B are sets of items and A is the body while B is the head. This rule indicates that customers who buy item A are likely to buy item B at the same time. For A and B to become an interesting rule, two conditions are necessary to be satisfied, 1.. Over the user specified percentage of transactions should contain both A and B. This percentage is called the minimum support and the items which meet the minimum support are called frequent itemsets. 11.

(22) 2. Under all the transactions that contain A, over the user specified percentage of them should also contain B. This percentage is called the minimum confidence. A multidimensional association mining is an association rule mining that involves one or more attributes (dimensions). The mining meta-pattern of a multidimensional association mining query is defined as follows: MP: <t G , t M , [wc], [hc], ms, mc> where t G , t M , wc, hc, ms and mc are components of a query, t G : the set of transaction ID (data granularity), t M : the set of interested mining attributes, wc: the optional “where” condition(s), hc: the optional “having” condition(s), ms: the minimum support and mc: the minimum confidence. Following is the example query of a multidimensional association mining: Mining: ProdName, CustAge Grouping By: CustID, TimeID Where: Country=”Taiwan” Support threshold: 65% Confidence threshold: 70% It indicates that the customers are interested in mining the associations between products and customers from customers’ daily transactions of Taiwan, under 65% of support and 70% of confidence.. 12.

(23) Following the work in [32], multidimensional association rules can be divided into three different types: intra-dimensional association rule, inter-dimensional association rule, and hybrid association rule. (1) Intra-dimensional association rule: This explores associations among items from only one dimension. For example, ProdName = “RAM512MB” → ProdName =“HP printer” involves only the attribute ProdName, which means that people who purchase “RAM512MB” are likely to purchase “HP printer”. (2) Inter-dimensional association rule: This explores rules that involve more than one dimensions and the items of each dimension appear only once in the rule. For example, Customer.Education = “High School”, Product.ProdName = “RAM512MB” → Supplier.City = “Taipei” contains three dimensions, Customer.Education, Prodcut.ProdName, and Supplier.City, and the value of each dimension occurs only once. This rule means that people whose highest education are “High School” and purchase “RAM512MB” are likely to get it from the supplier located in “Taipei”. (3) Hybrid association rule: This kind of association rule can be regarded as a combination of inter-dimensional and intra-dimensional associations. Values of a dimension can appear more than once in a rule. For example, the following if a hybrid-association rule: Customer.Education = “College”, Product.Sub_category = “Acer PC” → Product.Sub_category = “HP printer”, 13.

(24) which shows that people whose education level is “College” and purchase “Acer PC” also likely to purchase “HP printer”. Consider the example dataset given in Table 2 for mining multidimensional association rules. Table 3 shows the characteristics of the three different types of the multidimensional association rule mining.. Table 2 Part of multi-dimensional transactions City. Supplier. Product. Chicago. ACER. PC. John. Seattle. ASUS. MB. Peter. New York. ASUS. Notebook Bill. Dallas. HP. Notebook Sue. Seattle. DELL. LCD. Bill. Seattle. HP. Printer. Sue. …. …. …. Customer. …. Table 3 Multi-dimensional association rules discovered from Table 2 Rule Type. Dim. Overlap Dim. Rule pattern. MS. MC. Intra. 1. 0. Notebook, PC → Mother Board. 12%. 60%. Inter. ≥2. 0. ASUS → Notebook. 20%. 70%. Hybrid. ≥2. ≥1. Notebook → ASUS, PC. 40%. 50%. 14.

(25) 2.5 Related Work. In this section, we introduce some previous work related to our study, which will be described from three different perspectives, including the work on multidimensional association rules mining, work on appropriate support specification, and that on intelligent assistance for data mining process. The concept of mining multidimensional association rules from data warehouses or relational databases was coined by J. Han’s research group [9, 10, 11, 14, 32]. They first define the term of multidimensional association rules [32], and combined OLAP and data mining to develop DBMiner [10], a system that provides mining association rules, classification, prediction and clustering from data cube. The work conducted by Psaila and Lanzi [27] focus on exploiting the hierarchy information existing in data warehouses to discover multidimensional association rules. Their work were further extended by Perng et al. [22, 23] to develop a framework to exploit all possible multidimensional patterns and study ways for efficiently pruning unrelated sets of these patterns. Another work conducted by Lin et al. [17] is to build an online mining system dedicated to mining multidimensional association rules from data warehouses, called OMARS. The main difference from Han’s DBMiner is that instead of combining with traditional OLAP cube technology OMARS employs a different cube structure called OLAM cube to accelerate the performance of pattern discovery. Since Agrawal et al. established the support-confidence framework of association rule mining, many studies have been devoted to alleviate the tedious work for specifying appropriate support threshold to recognize interesting patterns. Some efforts tried to eliminate the necessity of support threshold, with or without other measures [5, 15.

(26) 16]. Lin and Tseng [18] instead proposed an automated way to specify the support threshold that is derived from the user specified minimum confidence. Another approach proposed by Zhang et al. [31] is to use a polynomial strategy to map the user-specified fuzzy support threshold to an actual minimum support. Hidber first [13] introduced the concept of allowing the users interactively adjust the support threshold during the course of mining process. Liu et al. [21] proposed a fuzzy matching technique incorporated with a set of user-specified patterns (user’s feedback) to help the user identify those interesting ones. A similar work on utilizing user’s interactive feedback to help a particular user effectively discover interesting patterns was conducted by Xin et al. [29]. Our approach of providing a user with a favorable support threshold is through learning other users’ experience that can be extracted from the information collected in the mining log. Although there have been lots of research studies on developing intelligent assistance for query process or information retrieval [24, 30], there is very little research conducted on data mining process. Bernstein et al. [4] proposed an intelligent assistance to guide the users in choosing appropriate classification methods. The CLIDE query formulation interface, proposed by Petropoulos et al. [24], established a QBE-like query builder embedded with a coloring scheme that guides the user toward formulating feasible queries. The intention behind the development of our proposed the intelligent assistance is similar to CLIDE but aims at multidimensional association rule mining. Besides, we have utilized the ontology and learning technologies to realize this interface.. 16.

(27) Chapter 3 The OntoWM System Framework Data mining is a knowledge intensive process in that it requires domain-specific knowledge (ontology) [3, 6, 12]. Contemporary data warehousing systems, however, are mostly based on relational databases, which cannot thoroughly describe organizational relation between data. If the data warehouse system can provide the relationships between organizational structures and semantics of data, the user can prevent improper data selection, thus avoid mining of some meaningless and useless rules or patterns. Avoiding such errors would further reduce the repetitiveness in mining process and improve the effectiveness. In [19, 20], our research group has proposed a system framework called OntoWM with intelligent assistance of user query formulation in multidimensional association mining is incorporated with ontologies. The ontologies provide background knowledge which current relational star schema does not provide. Additionally, ontologies of domain knowledge and personal user preference can also be integrated into the system. The system framework is shown in Figure 5.. 17.

(28) Figure 5 The system framework of multidimensional association mining system framework with intelligent assistance in query formulation. In the system framework, the mining process started with the formulation of a query which is performed by a user interactively with the system. The system checks the correctness and effectiveness of the query with the assistance of the ontologies. A correct query is then feed into the data retrieval engine to get the mining data from the data warehouse and processed by the mining engine thereafter to generate result rules. The result rules are sent to the users via the user interface display. When the user receives the mining results, he or she determines subjectively whether the results are acceptable. If it is negative, the mining query will be adjusted by the users and the mining process will be performed again until the acceptable results are achieved. If it is positive, the query will be maintained in the query log. The ontologies in the system framework include schema ontology, domain 18.

(29) ontology and user preference ontology. They are used to help semantic checking of a user’s query formulation. The schema ontology include two parts namely, schema and schema constraint ontology. The schema ontology is used to describe metadata of the data warehouse, including schema structures, dimension hierarchies and the types of measures along dimensions. Figure 6 is an example of schema ontology. The schema constraint ontology shows the exist constraints within the schema of a data warehouse for mining association rules which cannot be exhibited by any metadata of data structure. These constraints among attributes are not structural applicable but semantic implication. Figure 7 is an example of schema constraint ontology. The attributes classified by dimensions are mirrored on the left and right hand side as the entities of the constraint relationships. The relational links connect to the predicates are either unary or binary depend on the substance of the constraints. The domain ontology is used to construct the domain knowledge of the mining subject whose contents are specific and adaptable to certain domains. Figure 8 is an example of domain ontology for 3C products. It explores both classification and composition relationships which contain the information such as what components a PC has, what product belongs to what brand, compatibilities of software to OS and also the compatibilities among hardware, etc. A user preference ontology is used to collect user history query patterns, such as grouping attribute and mining item combinations, parameter setting (i.e. minimum support and minimum confidence), conditional constraint and the result of mining (i.e.: the frequent item sets). Figure 9 [28] is an example of user preference ontology. The query log maintains the history queries. that are accumulated before. Figure 10 is an example of query log. It is valuable in helping the minimum support suggestion. It presents assorted users mining experiences hence benefits the users, especially inexperienced ones, by providing them with useful 19.

(30) information. Through computing the similarity weight between the user-specified query and the queries in the query log, the system finds the most similar queries to the user’s mining intension. Support thresholds from those queries are being computed and the favorable minimum support can be derived and provided to the user. Originally the setting of the support threshold in the association rule mining is subject to the individual user’s subjectivity. In this way, the user’s subjectivity shall include extra knowledge of abundant other users’ experiences. The try-and-error process can be improved while formulating a mining query. The mining results are more effective and closer to the user’s need.. Dominate Dimension Additive Fact Optional Dimension Dimension Root Semi-Additive Fact Attribute Node Key Attribute Non-Additive Fact Hierarchy Relation. Salesman ID Salesman. Product ID Sale Amount. Product. Name. Type Sale Quantity. Size. Brand. Customer ID. Time ID Profit. Customer Sex. Time. Date. Category. Cost. City. Month. Region. Year. Country. Figure 6 Schema ontology. 20. Education.

(31) Dimension. Attributes. Salesman. SalesmanID Name. Product. Customer. Time. ProductID Size Type Brand Category CustomerID Sex Education Nationality Region City Day Month Year. Predicate. Dimension. Attributes SalesmanID. Exclusion Item-only Decide Decide Decide Follow. Salesman. Name ProductID Size Type Brand Category CustomerID Sex Education Nationality Region City. GroupOnly. Product. Customer. Day Month Year. Time. Figure 7 A schema constraint ontology. Classificatio n. Printer. PC. Composition Desktop PC. Hard Disk. Memory. ---. IBM 60GB. HP DeskJet. Epson EPL. Ink Photo Toner Cartridge Conductor Cartridge. Sony VAIO. S 60GB. Notebook. Gateway GE. RAM 512MB. RAM 256MB. IBM TP. ---. Figure 8 The domain ontology example of 3C production. 21.

(32) Attribute Index. Customer ID. Date. Product Name. Customer ID, Date. Gender, Education. Gender. Education. Customer ID, Category. Product Name, Education. (tG). Product Name, Salesman. (tM). (wc). City = ‘Taipei’. 60% , 90%. 60% , 85%. (ms , mc). 45% , 80%. Figure 9 An example of user preference ontology Cardinality_TAB PK,FK1 Attribute_ID Name Cardinality Upper_Bound Low_Bound. PK. Attribute Operator Variable. Attribute A Attirbute B. QID Group_Set_ID Mining_Set_ID Where_Set_ID MS MC Trans_Count TL_MIN TL_MAX TL_AVG FP_Count FPL_MIN FPL_MAX FPL_AVG AR_Count ARL_MIN ARL_MAX ARL_AVG. Where_TAB Where_ID. FK1. Q_LOG_1. Where_Set_TAB PK,FK1 Where_Set_ID PK,FK2 Where_ID. PK. Attri_Relation. Group_Set_TAB PK,FK2 Group_Set_ID PK,FK1 Group_ID. Figure 10 The structure of query log 22. Mining_Set_TAB PK,FK2 Mining_Set_ID PK,FK1 Mining_ID.

(33) Chapter 4 The Method of Minimum Support Suggestion The parameter setting in a user’s query formulation for a multidimensional association mining is not an easy job which has no clear-cut standard. The setting of the minimum support is especially essential to generate useful information. Different minimum support settings will find different numbers of frequent itemset, hence influence the rules mined. If the minimum support is set too low, frequent itemsets and rules will be numerous, thus recognizing the real useful knowledge become a challenge to the users. While minimum support is set too high, very few frequent itemsets can be found, thus some useful knowledge might be left out. Therefore, reasonable settings of minimum support are important in order to find useful knowledge. Users try the setting of minimum support to and fro to converge on a favorable set when they are satisfied. Fayyad et al [1] described data mining as a repetitive sequence of searching process, which implies that the efficiency of mining process depends much on the proper setting of a mining query, including the settings of minimum support and confidence.. 4.1 The Affection of Hierarchical Relationships. The hierarchical relationship is important in the transaction count analysis which is based on the information in the schema ontology. Figure 11 shows the hierarchies of the dimension table “Time”. The value of each node is the cardinality count of transactions 23.

(34) in “Time” dimension table. Attributes with higher hierarchies will contain those with lower hierarchies, thus the transactions of higher hierarchy will contain the lower ones. The following examples show how the transaction counts change when the concept hierarchies change.. Figure 12 is the original transaction distribution whose grouping. ID (t G ) = {Customer, Date}; if t G is changed the transaction distribution will be changed as well. Following are some examples. {Date}, while. Figure 13 shows the change if t G =. Figure 14 shows the corresponding change with respect to t G =. {Month}, a higher hierarchy than t G = {Date}. Therefore we know the lower hierarchies are subsets of the higher hierarchies.. Time Dimension. 1. Quarter of Year 4 Day of Week Day of Month. Year 2 Month of Year Week of Year 53 12. 31. Date = Time_ID 730. Figure 11 Concept hierarchies of Time dimension. 24. 7.

(35) Month. June. Date. T1. T2. T6. T7. T8. T9. T10. P3. P4. P4. P3. P3. P5. P3. C3 P3. C4 P2. C5. P5 P2. C6. Worker. T5. P1. P1. C2. Occupation. T4. P2. C1 Student. T3. July. P3. C7. Product P2. C8 Category Subcategory Customer. Figure 12 Data distribution with original data granularity. Date. T1. T2. T3. T4. T5. T6. T7. T8. T9. T10. P1. P2 P1. P3. P4. P4. P3. P3. P3 P3 P2. P5 P2 P3 P2. Figure 13 Data distribution with granularity by Date 25. P5.

(36) June. Month T1. T2. T3. July. T4. T5. T6. P2. T7. T8. T9. T10. P1. P1. P3. P4. P4. P3. P3. P5. P3 P3 P2. P5 P2 P3 P2. Figure 14 Data distribution with granularity by Month. 4.2 The Method of Minimum Support Calculation. Our strategy for suggesting a reasonable range of minimum support is taking the experience value from the query log, which reflects the experienced users’ knowledge. The similarity checking of two queries is essential in order to obtain the favorable minimum support range from the query log. The t G , t M and/or wc, formulated by a user, will be checked against the queries in the query log for similarity. The range of minimum support to suggest to the users is derived from the top N queries that are similar to the t G , t M and/or wc of the user formulated query. The matching degree of t G , 26.

(37) t M and wc are computed in different ways respectively and aggregated thereafter to obtain the degree of query similarities. The computational details will be described in the next sections.. 4.2.1 Matching Transaction ID (t G ). Users can select multiple grouping IDs (t G ) which are either from the same dimension or different dimensions, therefore cartesian product is used to calculate the grouping ID factor. Let G a be the t G formulated by users and G b be the t G from the user preference ontology, respectively. Suppose G a = {g a1 , g a2 ,…, g ai } and G b = {g b1 , g b2 , …g bj }, and dist(g ax ) and dist(g bw ) denote the counts of distinct values within the attribute g ax and g bw respectively, where 1 ≤ x ≤ i, 1 ≤ w ≤ j. The computation of the matching weight, T G , of each g ax and g bw in G a and G b is defined as follows: T G (g ax , g bw ) = If g ax , g bw have no hierarchical relationship 0   min(dist ( g ax ), dist ( g bw )) otherwise  max(dist ( g ), dist ( g )) ax bw . The matching degree of G a and G b , match G (G a , G b ), is computed as follows: i. match G (G a , G b ) =. j. ∑∑ T x =1 w=1. G. ( g ax , g bw ). Ga × Gb. For example, suppose we have 27.

(38) G a = {Cust_ID, Date} G b = {Cust_ID, Month}, and Date and Month have hierarchical relationship,. dist(Cust_ID) = 450,. dist(Date) = 365 and. dist(Month)=12.. Then, match G ({Cust_ID, Date}, {Cust_ID, Month}) is computed as follow: min(450,450) min(365,12) +0+0+ max(450,450) max(365,12) 2× 2. 4.2.2 Matching Interested Mining Attributes (t M). The matching function is designed using the concept of intersection set. The mining items that decide the transaction projection of data may have hierarchical relationship. For example, Figure 15 shows the data distribution of t G = {Date, Customer}, t M = {Product, Subcategory}. The transactions are the same but the supports of product and subcategory will be different because they have hierarchical relationship.. 28.

(39) Month. June. T1. Date. Student. P3. P3. T6. T7. P4. T8. T9. T10. S1 P3. P5. S1. P3. S2. P5. S1. S2. S3. S3. S2. S2. S3. S3. P2 P3. S1 S2. Product P2. C8. S1. S2 P4. P2. C6 C7. T5. P3. C4 C5. T4. P1. P1. C3. Worker. T3. P2. C1 C2. T2. July. Product.Subcategory S1. Subcategory Customer. Figure 15 Data distribution with t M = {Product, Product.Subcategory}. Let M a and M b be the t M formulated by users and the t M from the query log respectively. Suppose M a = {m a1 , m a2 ,…, m ai } and M b = {m b1 , m b2 , …m bj }. The computation of the matching weight of M a and M b is defined as follows:. match M (M a , M b )=. Ma ∩ Mb Ma ∪ Mb. For example, suppose we have. M a = {ProdName, Education} and M b = {ProdName,City, Age}. Then match M ({ProdName, Education}, {ProdName,City, Age}) is as follows,. {Pr odName, Education} ∩ {Pr odName, City, Age} {Pr odName, Education} ∪ {Pr odName, City, Age}. 29. =. 1 4.

(40) 4.2.3 Matching “where” Condition (wc). The format of a “where” condition is. Where: (conditional expression) [, (conditional expression)]…[, (conditional expression)].. The format of a conditional expression is. (Attribute Name) Op (Value),. where Op∈{‘<=’, ’>=’, ’=’, ’≠’} is the relational operator. For example, Where: Country=”Italy”, Sex=”Female”.. which confines the data to Italian female. From the content of the “where” condition, two similarity matches are necessary. First is the attribute match and second is the value match.. (1) attribute match. The attribute match of a “where” condition is computed in the same way as the matching of the interested mining attribute. Suppose WA a is the set of attribute(s) in the wc formulated by users and WA b is the set of an attribute(s) in the wc from the query log respectively. Suppose WA a = {wa a1 , wa a2 ,…, wa ai } and WA b = {wa b1 , wa b2 , …wa bj }. The computation of the matching weight of WA a and WA b is defined as follows: match WA (WA a , WA b )= WAa ∩ WAb WAa ∪ WAb 30.

(41) (2) value match. Let W a be the wc formulated by the user and W b be the wc from the query log. Suppose Wa = {wc a1 , wc a2 ,…, wc ai } and Wb = {wc b1 , wc b2 ,…, wc bj }, where wc ax = wa ax Op ax v ax and wc bz = wa bz Op bz v bz , 1 ≤ x ≤ i, 1 ≤ z ≤ j, wa ax is the attribute, Opax is the relational operator while v ax and v bz denote the “value” of the conditional expressio. To compute V, the matching degree of the “values” in conditional expressions, the following cases are considered:. case 1: wc ax = wc bz , V = 1; case 2: wa ax ≠ wa bz , V = 0; case 3: wa ax = wa bz and (op axz ≠ op bz or v axz ≠ v bz ), V = CV; The computation of CV, the complex conditional “value”, is defined as follows:  µ − µ≥  CV(wc ax , wc bz ) =  dist ( waax ) 0 µ≤ . where dist(wa ax ) denotes the counts of distinct values within wa ax . The numerator is the overlapping degree of the “values” in the conditional expressions between wc ax and wc bz .. µ = max(Opax vax , Opbz vbz ).  = min(Opax vax , Opbz vbz ) The values of μ and ℓ depend on the relational operators. The sequence of relational 31.

(42) operators is not important. Table 4 shows all cases of μ and ℓ while wa ax = wa bz. , where v with an notation in the subscript denotes the “value” of a conditional expression that has the corresponding operator. Table 4 All cases of μ and ℓ μ ℓ min( v le , v <le ) 0 dist(wa ax ) max(v ge ,v ge ) 0 0 dist(wa ax ) - 2 0 0 0 1 if v eq > v ge >= = 0 0 if v eq < v ge v ge -1 if v ne > v ge >= v ge ≠ v ge if v ne < v ge 1 if v eq < v le <= = 0 0 if v eq > v le v le if v ne > v le <= 0 ≠ v le -1 if v ne < v le v - v ge if v le >v gel <= >= le 0 0 otherwise subscript notation of v. <= le, >= ge, = eq, ≠ ne Op1 Op2 <= <= >= >= = = ≠ ≠ = ≠. The matching degree of the conditional “values” is defined as follows: i. match W (Wa , W b ) =. j. ∑∑V (wc x =1 z =1. ax. , wcbz ). matchWA ( waax , wabz ). For example, if. W a = Age < 60-70, W b = Age > 20-30, Salary > 50000, and suppose that Age has 10 distinct range values, then we have 32.

(43) match W (Wa , W b ) = 0.4 + 0 0.5. 4.2.4 Aggregating Query Similarity and Suggesting the Minimum Support Range Let Q a and Q b be the user defined query and the queries in the user preference ontology, respectively. Suppose the t G , t M and wc of Q a and Q b are as follows: Q a = <t G : G a , t M : M a , wc: Wa > Q b = <t G : G b , t M : M b , wc: Wb > The overall query similarity between Q a , Q b is defined as follows: QSimilarity (Q a , Q b ) = match G (G a , G b ) × match M (M a , M b ) × match W (Wa , Wb ) Through the similarity comparison of these queries, the top N similar queries from the query log are obtained. The final range of minimum support to be offered to the user is computed by the following steps:. 1. sorting supports of the top N queries;. 2. averaging the smaller half of the N queries as the lower bound;. 3. averaging the bigger half of the N queries as the upper bound;. 4. validating the lower bound with the smallest possible value which is 1/n, where n is the count of the transactions after grouping. If the lower bound is beyond the smallest possible value, the smallest possible value becomes the lower bound;. 33.

(44) 5. validating the upper bound with the largest possible value which is the maximum frequency among all the one-itemsets. If the upper bound is beyond the maximum possible support, the maximum possible support becomes the upper bound.. The result will be the favorable range of minimum support that is closer to the user’s mining intention. The system provides the information to the user interactively while he or she is formulating a query. This helps the user adjust the support threshold toward obtaining better mining results. The support threshold setting is not all subjective since it also refers to the knowledge of assorted users mining experiences in the query log. The efficiency of try-and-error process can be improved while the user is formulating a mining query. The mining results are more effective and easier to become closer to the user’s need.. 34.

(45) Chapter 5 Experimental Evaluation for Minimum Support Suggestion To evaluate the effectiveness the proposed support suggestion mechanism, we have conducted experiments from two perspectives, how the proposed mechanism can reduce the number of times for user re-mining and the total mining time. First, we will describe the environment of our experiments, including the dataset, the mining log, and the testing multi-dimensional queries used. Then we show and discuss the experimental results.. 5.1 Experimental Environment In these experiments, we have adopted an example data warehouse provided by Microsoft, Foodmart2000, which stores consolidated operational data reflecting the behavior of a supermarket company for daily selling and inventorying. The schema of Foodmart2000 consists of two main fact tables, Sale_fact and Inventory_fact, and six dimensional tables, Product, Time, Store, Warehouse, Customer, and Promotion. Detailed description of Foodmart2000 is shown in Table 5. According to the Foodmart2000 data schema, we have designed three test queries of different degree of complexity: easy, medium and hard. The first query shown in Table 6 is the simplest one which tries to discover interesting intra-associations between products that are daily purchased by different customer.. 35.

(46) Table 5 Data characteristics of Foodmart2000 Table Name. Table Type. Table Size. Sales_fact. Fact table. 269720. Inventory_fact. Fact table. 11352. Customer. Dimension table. 10281. Promotion. Dimension table. 1864. Product. Dimension table. 1560. Time_by_day. Dimension table. 730. Store. Dimension table. 25. Warehouse. Dimension table. 24. Table 6 Test query 1 (easy) Mining multidimensional associations [Where] Grouping_ID. <Customer.ID, Time.ID>. Mining_item. <Product.Name>. From. <Foodmart2000>. Confidence Threshold. 0.5. The second query shown in Table 7 is the one of medium complexity, which expresses user intention to discover if there exist any interesting inter-associations that customers of “High School Degree” education will be attracted to buy certain product subcategory owing to certain type of promotion, with data granularity of every week and every customer. The last query shown in Table 8 is the most complicated query, which aims to discover if there exists any hybrid association that exhibits certain subcategory products 36.

(47) of cost ≤ 15 and sales ≥ 30 are easily sold on certain month, with data granularity of every warehouse and everyday.. Table 7 Test query 2 (medium) Mining multidimensional associations [Where]. <Customer.Education = High School Degree>. Grouping_ID. <Customer.ID, Time.Week>. Mining_item. <Product.Subcategory, Promotion.type>. From. <Foodmart2000>. Confidence Threshold. 0.5. Table 8 Test query 3 (hard) Mining multidimensional associations [Where]. < Inventory_fact.cost ≤15, Inventory_fact.sales ≥ 30>. Grouping_ID. <Warehouse.ID, Time_by_Day.ID>. Mining_item. <Product.Subcategory, Time_by_Day.Month>. From. <Foodmart2000>. Confidence Threshold. 0.5. 5.2 Experimental Results We have asked 23 users to formulate and execute each of the three queries. The users have to change the support threshold repeatedly until they are satisfied with the generated rules. During the execution of each user query, a program on the background is running to record the number of re-mining times and the elapsed time that users spent on obtaining the satisfied association rules. Figure 16 summarizes the result of this experiment on the average number of 37.

(48) try-and-errors for each user in executing each type of query to pursue satisfied rules, with or without using the proposed suggestion mechanism. The results show that our proposed mechanism can significantly reduce the time of mining process.. Avg Mining Time (sec). Reduce Total Mining Time 160 140 120 100 80 60 40 20 0. without suggestion with suggestion. Easy. Medium. Hard. Test Query Type. Figure 17 depicts the average elapsed time spent by each user on each type of query to obtain satisfied association rules, with or without using the proposed suggestion mechanism. Again, it is clear to see that our mechanism can effectively reduce the execution time wasted on the support threshold setting.. Avg Re-mining Frequency. Reduce Try-and-Error Frequency. 40 30 20. without suggestion with suggestion. 10 0 Easy. Medium. Hard. Test Query Type. Figure 16 The average time with and without using the proposed suggestion mechanism 38.

(49) Avg Mining Time (sec). Reduce Total Mining Time 160 140 120 100 80 60 40 20 0. without suggestion with suggestion. Easy. Medium. Hard. Test Query Type. Figure 17 The average elapsed time spent on obtaining satisfied association rules with and without using the proposed suggestion mechanism. 39.

(50) Chapter 6 Implementation of Minimum Support Suggestion in OntoWM In this chapter, we will describe the functions of intelligent query assistance that help users set appropriate queries especially with the appropriate minimum support range. The users can therefore get mining results they want without going through long try-and-error process. The process of intelligent query assistance is shown in Figure 18. In Section 6.1, we introduce the intelligent query assistance that helps user set appropriate queries. In Section 6.2, the support suggestion mechanism implementation is shown, and the whole mining process with OntoWM is proposed in Section 6.3.. 40.

(51) User query formulation. No Syntax OK. Yes Schema and Constraint ontology. No Semantic OK. Yes Query Log and User Preference Ontology. Parameter suggestion. A correct query for mining process. Figure 18 The flowchart of a query formulation and assistance process. 6.1 Intelligent Query Assistance in OntoWM In order to alleviate the difficulties caused by the complexities of the multidimensional association query formulation, we use a scheme similar to the graphical query by example, referred to as GQBE [26]. The query clauses are shown in the GUI with syntax and semantic checking, which assistant users to express their mining desire more easily even if they are not familiar with the query syntax. We combine syntax formation into the user interface design to avoid error. Figure 19 shows the incorporation of syntax formulation into the user interface. 41.

(52) Mining multidimensional associations [ Where ]. < Customer.Gender ≠ female>. Group_ID. < Time.ID, Customer.ID>. Mining_ite m. < Product.Name>. Figure 19 The incorporation of syntax with query formulation interface Figure 20 shows that the dimensions of the user specified database server will be automatically displayed by treeview with which user can select grouping ID (t G ) and interested mining attributes (t M ) by drag-and-drop easily. As soon as the system obtains the user selected t G and t M , it will process the internal checking of the syntax and semantic correctness. If the selected attribute combination of grouping ID and interested mining attributes are incorrect or improper, the system will prompt with a warning or error message to inform the users to take proper action. If setting of a where condition is necessary, the user can click the Where conditions set button, then the screen in Figure 21 will be prompt and detailed setting of the where condition can be operated. With the schema or schema constraint ontology, we can extract the semantic information from them. The intelligent query assistance will execute semantic checking to verify the mining reasonability and respond with error or warning messages while any settings against the constraints defined in the ontology. Figure 22 to Figure 25 show some example situations.. 42.

(53) Figure 20 User interface for query formulation. Figure 21 Operation interface of “Where” condition. 43.

(54) Figure 22 Example on warning the choice of mining only attribute. Figure 23 Example on warning the choice of grouping only attribute. 44.

(55) Figure 24 Example on warning the inappropriate choice of two exclusive attributes. Figure 25 Example on warning the inappropriate choice of two attributes with concept hierarchy. 6.2 Intelligent Support Suggestion in OntoWM When a user’s query passes the above checking process, that query is ready to be provided with minimum support and minimum confidence. If the minimum support or minimum confidence is improperly set, the mining results will be affected and the user will take more efforts or longer time to get satisfactory mining results. The support suggestion mechanism that we propose in Chapter 4 is implemented here to recommend 45.

(56) an initial minimum support to the user. The query log keeps the history of user queries that provide support suggestion mechanism to find the Top-K queries to generate the minimum support suggestion range and feedback to the user. Figure 26 shows the interface of minimum support range suggestion.. Figure 26 Support suggestion method feedback the minimum support range. 46.

(57) Chapter 7 Conclusions and Future Work 7.1 Conclusions To provide the user with a robotic support threshold setting for multidimensional association mining, we propose in this thesis an intelligent minimum support suggesting system framework that integrated with the query log. We describe the construction and utilization of the query log which maintains the frequently used queries in the history. The methodology for query similarity comparison is presented in details. By using this methodology, the system finds the most similar queries in the query log to the user-specified query, aggregates them and obtains the favorable support range for the user to refer. With our proposed method, the setting of support threshold is not all subjective but also includes extra knowledge from other users’ experiences. This improves the efficiency of user’s query formulation process and the resulting rules tend to be closer to user’s requirement. The support suggestion mechanism is a part of the intelligent query assistant framework, and the intelligent query assistant is included in user interface for alleviating user loading on query formulation. In that way, user can focus on mining result analysis without paying too much attention on the query. In this thesis, we have implemented the intelligent query assistant based on our OntoWM system. With the help of the query assistant, the user only needs to select any attributes what he wants to mine without paying attention on syntax checking. With the schema ontology and attribute constraint ontology working on intelligent query assistant, user can make a 47.

(58) reasonable mining query without deeply understanding the schema of the data warehouse.. 7.2 Future Work The results accomplished in this thesis are the preliminary research towards our research project for developing an ontology integrated, interactive mining system. Our implementation is based on the local database only. The ontologies proposed are also local to our environment. A more sophisticated system that can incorporate with global sharable ontology will be a challenge.. 48.

(59) References [1]. R. Agrawal and R. Srikant, “Fast algorithm for mining association rules in large databases”, in Research Report RJ 893, IBM Almaden Research Center, San Jose, CA, 1994.. [2]. R. Agrawal and R. Srikant, “Fast algorithm for mining association rules”, in Proceedings of International Conference of Very Large Databases, pp. 487-499, 1994.. [3]. J.M. Aronis, F.J. Provost, and B.G. Buchanan, “Exploiting background knowledge in automated discovery,” in Proceeding of 2nd Intl. Conf. Knowledge Discovery and Data Mining, 1996.. [4]. A. Bernstein, F. Provost, and S. Hill, “Toward intelligent assistance for a data mining process: An ontology-based approach for cost-effective classification,” IEEE Trans. Knowledge and Data Engineering, Vol. 17, No. 4, pp. 503-518, 2005.. [5]. E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D. Ullman, and C. Yang, “Finding interesting associations without support pruning,” in Proceedings of IEEE International Conference on Data Engineering, pp. 489–499, 2000.. [6]. U. Fayyad, P.S. Gregory, and S. Padhraic, “The KDD process for extracting useful knowledge from volumes of data,” Communications of the ACM, Vol. 39, No. 11, 1996, pp. 27–34.. [7]. K. Fukunaga and P.M. Narendra, “A branch and bound algorithm for computing k-nearest neighbors,” IEEE Transactions on Computers, Vol. C-24, No. 7, pp. 750-753, 1975. 49.

(60) [8]. T.R. Gruber, “A translation approach to portable ontology specifications,” Knowledge Acquisition, Vol. 5, pp. 199-220, 1993.. [9]. J. Han, “OLAP mining: An integration of OLAP with data mining,” in Proceedings of IFIP Conference on Data Semantics, pp. 1-11, 1997.. [10]. J. Han et al., “DBMiner: A system for data mining in relational databases and data warehouses,” in Proceedings of 1997 Conference of the Centre for Advanced Studies on Collaborative Research, pp. 250-255, 1997.. [11]. J. Han, L.V.S. Lakshmanan, and R.T. Ng, “Constraint-based multidimensional data mining,” IEEE Computer, Vol. 32, No. 8, pp. 46-50, 1999.. [12]. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001.. [13]. C. Hidber, "Online association rule mining," in Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 145-156, 1999.. [14]. M. Kamber, J. Han, and J.Y. Chiang, “Metarule-guided mining of multi-dimensional association rules using data cubes,” in Proceedings of 1997 International Conference on Knowledge Discovery and Data Mining, pp. 207-210, 1997.. [15]. R. Kimball, L. Reeves, M. Ross and W. Thornthwaite, The Data Warehouse Lifecycle Toolkit, John Wiley & Sons, New York, 1998.. [16]. J. Li and X. Zhang, “Efficient mining of high confidence association rules without support thresholds,” in Proceedings of 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases, 1999.. [17]. W.Y. Lin, J.H. Su, and M.C. Tseng, “OMARS: The framework of an online multi-dimensional association rules mining system,” in Proceedings of 2nd International Conference on Electronic Business, 2002. 50.

(61) [18]. W.Y. Lin and M.C. Tseng, “Automated support specification for efficient mining of interesting association rules,” Journal of Information Science, Vol. 32, No. 3, pp. 238-250, 2006.. [19]. W.Y. Lin, C.A. Wu and M. C. Tseng, “Ontology-Incorporated Mining of Association Rules in Data Warehouse”, in Proceedings of 11th Conference on Artificial Intelligence and Applications, 2006.. [20]. W. Y. Lin, C. A.Wu and C. C. Wu, “Ontology-based query formulation for mining associations from data warehouses”, in Proceedings of Multimedia and Networking Systems Conference, 2006.. [21]. B. Liu, W. Hsu, L.F. Mun, and H.Y. Lee, “Finding interesting patterns using user expectations,” IEEE Transactions on Knowledge and Data Engineering, Vol. 11, No. 6, pp. 817-832, 1999.. [22]. C.S. Perng, H. Wang, S. Ma, and J.L. Hellerstein, “Farm: A framework for exploring mining spaces with multiple attributes,” in Proceedings of 1st IEEE International Conference on Data Mining, pp. 449-456, 2001.. [23]. C.S. Perng, H. Wang, S. Ma, and J.L. Hellerstein, “User-directed exploration of mining space with multiple attributes,” in Proceedings of 2nd IEEE International Conference on Data Mining, pp. 394-403, 2002.. [24]. M. Petropoulos, A. Deutsch, and Y. Papakonstantinou, “Interactive query formulation over web service-accessed sources,” in Proceedings of 2006 ACM SIGMOD International Conference on Management of Data, pp. 253-264, 2006.. [25]. G. Piatetsky-Shapiro, “Discovery, analysis, and presentation of strong rules,” Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W. J.Frawley,. 51.

(62) editors, pp. 229-238, Cambridge, MA: AAAI/MIT Press, 1991.. [26]. F. Polat, A. Cosar, and R. Akhajj, "Semantic information-based alternative plan generation for multiple query optimization," Information Sciences, Vol. 137, pp. 103-133, 2001.. [27]. G. Psaila and P.L. Lanzi, “Hierarchy-based mining of association rules in data warehouses,” in Proceedings of ACM Symposium on Applied Computing, 2000, pp. 307-312.. [28]. C. A. Wu, W. Y. Lin, C. L. Jiang and C.C. Wu “Suggesting the Favorable Minimum Support for Multidimensional Association Mining Using User Preference Ontology”, in Proceedings of IEEE International Conference on Granular Computing, 2009.. [29]. D. Xin, X. Shen, Q. Mei, and J. Han “Discovering interesting patterns through user's interactive feedback,” in Proceedings of 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 773-778, 2006.. [30]. S.Y. Yang and C.S. Ho, “Ontology-supported user models for interface agents,” in Proceedings of 4th Conference on Artificial Intelligence and Applications, pp. 248-253, 1999.. [31]. S. Zhang, X. Wu, C. Zhang, and J. Lu, “Computing the minimum-support for mining frequent patterns,” Knowledge and Information Systems, Vol. 15, No. 2, pp. 233-257, 2008.. [32]. Hua Zhu, “On-Line Analytical Mining of Association Rules,” Master’s Thesis, Simon Fraser University, U.S.A, Dec 1998.. [33]. OWL. Web. Ontology. Language 52. Use. Cases. and.

(63) Requirements, http://www.w3.org/TR/webont-req/, 2004.. 53.

(64)

參考文獻

相關文件

Research has suggested that owning a pet is linked with a reduced risk of heart disease, fewer visits to the doctor, and a lower risk of asthma and allergies in young

In this paper, a novel subspace projection technique, called Basis-emphasized Non-negative Matrix Factorization with Wavelet Transform (wBNMF), is proposed to represent

智慧型手機的 Android

In this chapter, a dynamic voltage communication scheduling technique (DVC) is proposed to provide efficient schedules and better power consumption for GEN_BLOCK

由於資料探勘 Apriori 演算法具有探勘資訊關聯性之特性,因此文具申請資 訊分析系統將所有文具申請之歷史資訊載入系統,利用

介面最佳化之資料探勘模組是利用 Apriori 演算法探勘出操作者操作介面之 關聯式法則,而後以法則的型態儲存於介面最佳化知識庫中。當有

In order to improve the aforementioned problems, this research proposes a conceptual cost estimation method that integrates a neuro-fuzzy system with the Principal Items

However, the information mining system has already been a trend of the current epoch, if it is possible to obtain an effective management system to integrate data that relates to