3 / 27
high.
To deal with this disadvantage, we introduce an iterative missing-value completion method to fully infer the missing attribute values by combining an iterative mechanism and datamining techniques. The method uses the RAR support criterion [11] to extract useful association rules for inferring the missing values in an iterative way. It consists of three phases. The first phase uses the association rules which are mined froman original incomplete dataset to roughly complete the missing values. The second phase uses the reduced minimum support to gather more association rulesfrom the original incomplete dataset to complete the rest of missing values from phase 1 in an iterative way until no missing values exist. The third phase uses the association rulesfrom the completed dataset to correct the missing values that have been filled in according to the association rules until the missing values converge. Experiments on two datasets are also made to show the performance of the proposed approach.
However, the fuzzy association rules derived in that way are not complete, as some possible fuzzy association rules might be missing. This paper proposes a new fuzzy data-mining algorithm for extracting all possible fuzzy association rulesfrom transactions stored as quantitative values. The proposed algorithm can derive a more complete set of rules but with more computation time than the previous method. Trade-off thus exists between the computation time and the completeness of rules. Choosing an appropriate mining method thus depends on the requirement of the application domains.
2.3.1 SOM
Kohonen proposed SOM in 1980. It is an unsupervised two-layer network that can recognize a topological map froma random starting point. By SOM we can cluster enterprise’s customers, products, suppliers, etc. According to different clusters’ characteristics, different marketing strategies may be adopted by making use of the corresponding discovered association rules. In SOM network, input nodes and output nodes are fully connected with each other. Each input node contributes to each output node witha weight. Figures 3 and 4 are the network structure and flow chart for SOM training procedure, respectively. In our developed system user can assign different numbers of output nodes (cluster number), learning rate, radius rate and converge error rate, etc.
f ðV ðjÞ Þ ¼ W CAR CARðV ðjÞ Þ W V jV ðjÞ j ð9Þ where V ðjÞ denotes aset consisting of the effective fuzzy classification rules obtained by s ðjÞ c ðjÞ , and W CAR and W V are relative weights of the classifi- cation accuracy rate by V ðjÞ (i.e., CARðV ðjÞ Þ) and the number of fuzzy rules in V ðjÞ (i.e., jV ðjÞ j), re- spectively. The chromosome that has the maxi- mum fitness value in the final generation is further used to examine the classification performance of the proposed method. That is, the acquisition of a compact fuzzy rule setwithhigh classification ac- curacy rate is taken into account in the overall
The traditional Apriori algorithm cannot classify the infrequent items to interesting itemsets since the subjective domain knowledge is ignored. A huge amount of subjective domain knowledge may exist, which can be considered as potential subjective constraints and measures for evaluat- ing association rules. Following the discovery and report- ing of some rules, adata miner can select the subjective interestingness measures in Step 3. In market basket anal- ysis, understanding which products are usually bought together by customers and which products are beneficial to sellers are both interesting subjects for marketing ana- lysts. The former can be measured in terms of support and confidence in association rules. In this paper, the sub- jective measures of sellers’ profits are evaluated in terms of itemset value and cross-selling profit corresponding to the association rules. For association rules like X ) Y, four criteria are jointly used for rule evaluation as follows:
2.BASIC CONCEPTS 2.1 Association RulesDatamining can be applied to discover the useful patterns and rules by exploring and analyzing a large quantity of data. That is, a collection of datafrom customer surveys, health studies, market examinations, item banks and other raw data needs further analysis to transform it into useful information. In general, datamining involves the recognition of implicitly patterns that are hard to be analyzed, even though the use of traditional statistical techniques. In addition, an important mining task in this area is to discover association rules [10]. Suppose that basket data consists of items bought by a customer over a period of time, mininga large collection of basket data by association rules is to refine the relation between the sets of items with some specified confidences and support. The definitions of the association rules are reviewed as follows:
3.3.2. Incremental procedure of THUI-Mine
As shown in Table 3, D indicates the unchanged por- tion of an ongoing transaction database. The deleted and added portions of an ongoing transaction database are denoted by D and D + , respectively. It is worth mentioning that the sizes of D + and D , i.e., jD + j and jD j respectively, are not required to be the same. The incremental procedure of THUI-Mine is devised to maintain temporal high utility itemsets efficiently and effectively. This procedure is shown in Fig. 4. As mentioned before, this incremental step can also be divided into three sub-steps: (1) generating tempo- ral high TWU2I in D = db 1,3 D , (2) generating tempo- ral high TWU2I in db 2,4 = D + D + and (3) scanning the database db 2,4 only once for the generation of all temporal high utility itemsets. Initially, after some update activities, old transactions D are removed from the database db m,n and new transactions D + are added (in Step 6). Note that D db m,n . Denoting the updated database as db i,j , note that db i,j = db m,n D + D + . We denote the unchanged transactions by D = db m,n D = db i,j D + . After load- ing Thtw m,n of db m,n into CF where I 2 Thtw m,n , we start the first sub-step, i.e., generating temporal high TWU2I in D = db m,n D . This sub-step reverses the cumulative processing which is described in the pre-processing proce- dure. From Step 8 to Step 16, we prune the occurrences of an itemset I, which appeared before partition P i , by deleting the value I.twu where I 2 CF and I.start < i. Next, from Step 17 to Step 39, similarly to the cumulative pro- cessing in Section 3.3.1, the second sub-step generates tem- poral high TWU2I in db i,j = D + D + and employs the scan reduction technique to generate C i;j hþ1 . Finally, to gen- erate temporal high utility itemsets, i.e., Thu i,j , in the updated database, we scan db i,j only once in the incremen- tal procedure to find temporal high utility itemsets. Note that Thtw i,j is kept in main memory for the next generation of incremental mining.
In order to conduct effective data mining, one needs to first examine what kind of features an applied knowledge dis- covery system is expected to have and what kind of chal- [r]
Machine-learning and data-mining techniques have been developed to turn data into useful task-oriented knowledge. Most algorithms for mining association rules identify relationships among transactions using binary values and find rules at a single-concept level.
Transactions with quantitative values and items with hierarchical relationships are, however, commonly seen in real-world applications. This paper proposes a fuzzy multiple-level mining algorithm for extracting knowledge implicit in transactions stored as quantitative values. The proposed algorithm adopts a top-down progressively deepening approach to finding large itemsets. It integrates fuzzy-set concepts, data-mining technologies and multiple-level taxonomy to find fuzzy association rulesfrom transaction data sets. Each item uses only the linguistic term with the maximum cardinality in later mining processes, thus making the number of fuzzy regions to be processed the same as the number of original items. The algorithm therefore focuses on the most important linguistic terms for reduced time complexity.
discovery of more general and important knowledge fromdata. Relevant taxonomies
of data items are thus usually predefined in real-world applications. An item may,
however, belong to different classes in different views. When taxonomic structures are
not crisp, hierarchical graphs can be used to represent them. Terminal nodes on the
lexicographic order. Frequent itemsets are computed iteratively in the ascending order of size. Assume the largest frequent itemsets contain k items, it takes k iterations for mining all frequent itemsets. Initial iteration computes the frequent 1-itemsets L 1 . Then, for each iteration i≤ k, all frequent i-itemsets are computed by scanning database once. Iteration i consists of two phases. First, the set C i of candidate i-itemsets are created by joining the frequent (i-1)-itemsets in L i-1 found in the previous iteration. Next, the database is scanned for determining the support of all candidates in C i and the frequent i-itemsets L i are extracted from these candidates. This iteration is repeated until no more candidates can be generated. The Apriori algorithm needs to take k database passes to generate all frequent itemsets. For disk resident databases, this requires reading the database completely for each pass resulting in a large number of disk reads. It means that the Apriori algorithm takes a huge I/O operations.
The discovery of fuzzy association rules is an important data-mining task for which many algorithms have been proposed.
However, the efficiency of these algorithms needs to be improved to handle real-world large datasets. In this paper, we present an efficient method named cluster-based fuzzy association rule (CBFAR) to discover generalized fuzzy association rulesfrom web structures. The CBFAR method is to create fuzzy cluster tables by scanning the browse information database (BIDB) once, and then clustering the browse records to the k-th cluster table, where the length of a record is k. The counts of the fuzzy regions are stored in the Fuzzy_Cluster Tables. This method requires less contrast to generate large itemsets. The CBFAR method is also discussed.
3 Dept. of Computer Science and Information Engineering, Tamkang University, Taiwan
1 wylin@nuk.edu.tw; 2 waiewing@gmail.com; 3 chchen@mail.tku.edu.tw
Abstract. An indirect association refers to an infrequent itempair, each item of which is highly co-occurring witha frequent itemset called “mediator”. Al- though indirect associations have been recognized as powerful patterns in re- vealing interesting information hidden in many applications, such as recom- mendation ranking, substitute items or competitive items, and common web navigation path, etc., almost no work, to our knowledge, has investigated how to discover this type of patterns from streaming data. In this paper, the problem of mining indirect associations fromdata streams is considered. Unlike con- temporary research work on stream datamining that investigates the problem individually from different types of streaming models, we treat the problem in a generic way. We propose a generic window model that can represent all classi- cal streaming models and retain user flexibility in defining new models. In this context, a generic algorithm is developed, which guarantees no false positive rules and bounded support error as long as the window model is specifiable by the proposed generic model. Comprehensive experiments on both synthetic and real datasets have showed the effectiveness of the proposed approach as a ge- neric way for finding indirect association rules over streaming data.
上傳時間: 2009-12-17T06:58:05Z 出版者: Asia University
摘要: With the rapid increase in the use of databases, the problem of missing values inevitably arises. The techniques developed to recover these missing values effectively should be highly precise in order to estimate the missing values completely. The mining of association rules can effectively establish the relationship among items in databases.
compares these itemsets with the previously retained large and pre-large 1-itemsets. It partitions candidate 1-itemsets into three parts according to whether they are large or pre-large for the original database. If a candidate 1-itemset from the newly inserted transactions is also among the large or pre-large 1-itemsets from the original database, its new total count for the entire updated database can easily be calculated from its current count and previous count since all previous large and pre-large itemsets with their counts have been retained. Whether an originally large or pre-large itemset is still large or pre-large after new transactions have been inserted is determined from its new support ratio, as derived from its total count over the total number of transactions. On the contrary, if a candidate 1-itemset from the newly inserted transactions does not exist among the large or pre-large 1-itemsets in the original database, then it is absolutely not large for the entire updated database as long as the number of newly inserted transactions is within the predefined number of new transactions. In this situation, no action is needed.
Datamining can explore the hidden messages fromdata for decision-makers. When facing the rush time of emergency room、how to aid medical personnels to provide effective services in order to enhance patient safety is a very important issue. In this study、the use of six methods; for example、Affinity Set、Back-propagation Neural Network、Rough Set theory、Support Vector Machine
、Decision Tree and Association Rules、are computed by their performances of Receiver Operating Characteristic (ROC) curve to find the best model’s capability of revisiting rules for emergent patients. Study results show that Support Vector Machine has the best classification power、the second best is the affinity set model、and they both have the prediction accuracy of 80%. However
National Chengchi University Taipei, Taiwan, R.O.C.
Abstract
Mining frequent itemsets has been widely studied over the last decade. Past research focuses on mining frequent itemsets from static databases. In many of the new applications, data flow through the Internet or sensor networks. It is challenging to extend the mining techniques to such a dynamic environment. The main challenges include a quick response to the continuous request, a compact summary of the data stream, and a mechanism that adapts to the limited resources. In this paper, we develop a novel approach for mining frequent itemsets fromdata streams based on a time-sensitive sliding window model. Our approach consists of a storage structure that captures all possible frequent itemsets and a table providing approximate counts of the expired data items, whose size can be adjusted by the available storage space.
Transactions with quantitative values and items with hierarchy relation are, however, commonly seen in real-world applications. In this paper, we introduce the problem of mining generalized association rules for quantitative values. We propose fuzzy generalized rulesmining algorithm for extracting implicit knowledge from transactions stored as quantitative values. Given aset of transaction and predefined taxonomy, we want to find fuzzy generalized association rules where the quantitative of items may be from any level of the taxonomy. Each item uses only the linguistic term with the maximum cardinality in later mining processes, thus making the number of fuzzy regions to be processed the same as that of the original items. The algorithm can therefore focus on the most important linguistic terms and reduce its time complexity. We propose algorithm combines fuzzy transaction datamining algorithm and mining generalized association rules algorithm. This paper related to set concepts, fuzzy datamining algorithms and taxonomy and generalized association rules.
otFCðR b Þ ¼ max
j fo j FCðR j ÞjR j A TRg; ð10Þ
where TR is the set of fuzzy rules generatedby the proposed method. The adaptive rules are further employed to adjust the fuzzy confidence of R b : If t p is correctly classifiedthen FC(R b ) is increased; otherwise, FC(R b ) is decreased. Nozaki et al. (1996) also suggested that the learning rates shouldbe specifiedas 0 oZ 1 5 Z 2 o1: Actually, Z 1 ¼ 0:001; Z 2 ¼ 0:1 and J max ¼ 500 are usedin the experiment. In the subsequent section, experimental results from the iris data demon- strate the effectiveness of the proposedmethod. How- ever, the aim of the experiment is to show the feasibility andthe problem-solving capability of the proposed methodfor classification problems. That is, method s about the acquisition of appropriate parameter specifi- cations to obtain higher classification accuracy rates and smaller number of fuzzy if–then rules are not considered in this paper.