Evaluation on A Real Dataset, Foodmart - Experimental Evaluation

CHAPTER 2 Review of Related Works

3.4 Experimental Evaluation

3.4.4 Evaluation on A Real Dataset, Foodmart

The dataset sales_fact_1997 in the real dataset Foodmart was used to evaluate the performance of the three algorithms. Figures 3.7 to 3.10 showed a series of comparisons on weighted frequent upper-bound itemsets and execution efficiency for various minimum weighted support thresholds, ranging from 0.0052% to 0.0044%.

Figure 3.7: Comparison of the numbers of weighted frequent upper-bound itemsets required by the three algorithms under different thresholds.

Figure 3.8: The pruning rate of proposed two algorithms under different thresholds.

Figure 3.9: Efficiency comparison of the three algorithms under different thresholds.

Figure 3.10: Efficiency improvement of the proposed algorithms under different thresholds.

As shown in the figures, we could observed that the performance of both the two proposed algorithms for the real dataset under different minimum weighted support thresholds exceeds the traditional WFIM algorithm in terms of the number of weighted upper-bound itemsets and execution efficiency.

CHAPTER 4 Weighted Sequential Pattern Mining

In this chapter, we extend our research topic from weighted frequent itemsets mining to weighted sequential pattern mining. We first propose a projection-based weighted sequential pattern mining algorithm (PWS) for mining weighted sequential patterns. Next, the projection-based weighted sequential pattern mining algorithm with improved strategies (PWSI), tightening and filtering, is developed.

4.1 Problem and Definitions

To understand the problem of weighted sequential pattern mining, consider the sequence database given in Table 4.1, in which each sequence consists of two features, the sequence identification (SID) and items purchased (or events frequency). There are eight items in the sequences, respectively denoted as A to H. The predefined weight of each item is shown in Table 4.2.

For the formal definitions of weighted sequential pattern mining, a set of terms related to the problem of weighted sequential pattern mining [41][44] is defined below.

Table 4.1: Set of five sequences for given example.

Table 4.2: Weights of items given in Table 4.1.

Item Weight

Definition 1. A sequence Seq is composed of a set of itemsets in order of creation, the

size of the sequence Seq, |Seq|, is the number of itemsets in Seq. In addition, if the number of items in a sequence, l_Seq, is l, the sequence Seq with length l is called an l-sequence. For simplicity, the brackets around an itemset are removed if there is only one item in the itemset.

For example, the 5-sequence <(A)(AB)(CD)>, whose size is 3, can be simplified as

sequence α is called the sub-sequence of sequence β, and sequence β is called the super-sequence of sequence α. For example, the sequence <ABC> is the subsequence of

<(A)(AB)(CD)>, and the sequence <(A)(AB)(CD)> is the super-sequence of <ABC>.

Definition 3. A sequence database SDB is composed of a set of sequences. That is, SDB

= {Seq1, Seq2, …, Seqy, …, Seqz}, where Seqy is the y-th sequence in SDB.

Definition 4. The weight of a subsequence S, wS, is the sum of weights values of all itemsets in S divided by the number of itemsets in S. That is:

|

where |S| and w_X are the number of itemsets in the subsequence S and the weight of the itemset X in S, respectively. For example, in Table 4.1 and Table 4.2, the third sequence

<ACF(DE)F> includes five itemsets, A, C, F, (DE), and F, whose weights are 0.10, 0.20, 0.55, 0.35, and 0.55, respectively. Therefore, w<ACF(DE)F> = (0.10 + 0.20 + 0.55 + 0.35 + 0.55) / 5 = 0.35.

Definition 5. The sequence maximum weight of a sequence S, smw_S, is the maximum weight among those of all items in sequence S. For example, in Table 4.1, the first sequence

<BCB> includes two items, B and C, whose weights are 0.15 and 0.20, respectively.

Therefore, smw<BCB> = 0.20.

Definition 6. The total sequence maximum weight of a sequence database SDB, tsmw, is

the sum of the sequence maximum weights of all sequences in SDB. That is:

Definition 7. The weighted support value of a subsequence S, wsupS, is the sum of the weights of the sequences that include S in SDB divided by the total sequence maximum

For example, in Table 4.1, the weight of the subsequence <CF> is 0.375, and it appears in the sequence, Seq₃. The total sequence maximum weight is 3.20. The weighted support of

<CF> can then be calculated as 0.375 / 3.2, which is 11.71%.

Definition 8. Let λ be a pre-defined minimum weighted support threshold. A

subsequence S is called a weighted sequential pattern, WS, if wsup_S≧λ. For example, ifλ=

30%, then the subsequence <CF> is not a weighted sequential pattern since wsup_<CF> = 11.71%≧λ.

The downward-closure property in traditional sequential pattern mining is not maintained in weighted sequential pattern mining. Take item A in Table 4.1 as an example.

There are three sequences that include item A in Table 4.1. The weight of item A is 0.10. The weighted support value of subsequence <A> can be calculated as (0.10 + 0.10 + 0.10) / 3.2, which is 9.37%. Ifλ= 30%, then subsequence <A> is not a weighted sequential pattern, but its super-sequence <AF> is a weighted sequential pattern. As this example shows, weighted sequential pattern mining is more difficult than traditional sequential pattern mining. This study proposes an effective sequence maximum weight (SMW) model to reduce the number of unpromising subsequences to speed up the execution of finding weighted sequential patterns. The terms used in the proposed SMW model are defined below.

Definition 9. The sequence-weighted upper bound of a subsequence S, swubS, is the sum of sequence maximum weights of the sequences that include S in a sequence database divided by the total sequence maximum weight tsmw of the sequence database SDB. That is:

tsmw

subsequence S is called a weighted frequent upper-bound pattern (WFUB) if swub_S≧λ. For example, ifλ= 30%, then the subsequence <D> is a weighted frequent upper-bound pattern since swub_<D> = 64.06%≧λ.

Based on the definitions above, a weighted sequential pattern considers the individual weights of items in a sequence dataset. The goal is to solve effectively and efficiently find all the weighted sequential patterns whose weights are larger than or equal to a predefined minimum weighted support threshold λ in a given sequence database. The details of the proposed PWSI algorithm are described in the next section.

4.2 Projection-based Weighted Sequential Pattern

在文檔中有效權重資料探勘方法之研究 (頁 71-79)