Tightening Upper bound strategy - Projection-based Weighted Sequential Pattern Mining with Impr

CHAPTER 2 Review of Related Works

4.2 Projection-based Weighted Sequential Pattern Mining with Improved

4.2.3 Tightening Upper bound strategy

For example, an assumed sequence is <ABCDE> with five items, respectively denoted as A to E, and the weights of the five items are 0.2, 0.60, 0.30, 0.40, and 0.50, respectively. In addition, assume weighted frequent upper-bound 1-patterns include <A>, <D>, <E>, and <F>.

Since item B in <ABDEF> is not appeared in the weighted frequent upper-bound 1-patterns, item B can be removed from the sequence, and then the new sequence is <ADEF>. The new upper-bound of weight for the sequence <ADEF> can be re-updated as 0.5.

As the example describes, the unpromising items can be pruned effectively by the items appearing in the weighted frequent upper-bound patterns. Hence, the pruning strategy can be used to improve the performance in finding weighted sequential patterns.

4.2.4 Filtering Strategy

The concepts of the strategy are similar as described in section 3.3.2. However, the

application of the strategy in weighted itemset mining and weighted sequential pattern mining might have some difference. The main different point is that the items in the sequence might appear in any order, we cannot judge the relationship between the last item of the prefix pattern and the items after the prefix pattern when the last item of the prefix pattern is not processed. In the case, the items after the prefix pattern might be kept in the projected sequence.

For example, assume the processed already weighted frequent upper-bound 1-patterns are <E> and <F>. The current processed 1-pattern is <D>, there exists a projected sequence

<DACEF> with prefix pattern <D>, and the 2-pattern processed now is <DA>. Because the pattern <A> is not processed, the items after the prefix pattern <DA> can thus be kept in the projected sequence with prefix pattern <DA>. The projected sequence with <DA> as prefix pattern is still <DACEF>.

4.2.5 The Proposed Mining Algorithm with Improved Strategies

The procedures of the proposed mining algorithm with two strategies, tightening and filtering, are then stated below.

INPUT: A set of items, each with a weight; a sequence database SDB, in which each sequence includes a subset of items; a minimum weighted support threshold .

OUTPUT: A final set of weighted sequential patterns, WS.

STEP 1: For each sequence Seqy in SDB, find the sequence maximum weight smwy of the sequence Seqy as:

smwy = max{wy1, wy2, …, wyj}, where wyj is the weight value wi of the j-th item i in Seqy.

STEP 2: Find the total sequence maximum weight tsmw of the sequence database SDB as:





STEP 3: For each item i in SDB, do the following substeps.

(a) Calculate the sequence-weighted upper bound swubi of i as:

tsmw item in a sequence may appear multiple times, but the frequency of the item in the sequence Seq_y has to be taken as 1.

(b) Calculate the actual weighted support wsupi of item i as:

tsmw

STEP 4: For each item i in SDB, do the following substeps.

(a) If the sequence-weighted upper bound swubi of i is larger than or equal to the minimum weighted support threshold , put it in the set of weighted frequent upper-bound 1-patterns, WFUB1.

(b) If the actual weighted support wsupi i is larger than or equal to the minimum weighted support threshold , put it in the set of weighted sequential 1-patterns, WS1.

STEP 5: Set r = 1, where r represents the number of items in the processed subsequences.

STEP 6: Gather the items that appear in the set of WFUB1, and put them in the set of possible items, PIr.

STEP 7: For each y-th sequence Seqy in SDB, do the following substeps.

(a) Get each item i in Seqy.

(b) Check whether item i appears in PIr. If it does, then keep item i in Seqy; otherwise, remove item i from Seqy.

(c) If the number of items kept in the modified sequence Seqy is less than r+1, then remove the modified sequence Seqy from SDB; otherwise, keep it in SDB.

STEP 8: Process each pattern S in the set of WFUB1 from the last one to the first one in alphabetical order of them by the following substeps. (as mentioned in Section 3.3.2 and 4.2.4)

(a) Find the sequences that include S in SDB, and put them in the set of projected

sequences sdbS of pattern S.

(b) Find the sequence maximum weight smwy of Seqy in sdbS as:

smwy = max{wy1, wy2, …, wyj}, where wyj is the weight value wi of the j-th item iyj in Seqy.

(c) Find all the weighted sequential patterns with S as their prefix pattern by the Finding-WS(S, sdbS, r) procedure. Let the set of returned weighted sequential

patterns be WSS.

STEP 9: Output the set of weighted sequential patterns in all WSS.

After STEP 9, all the weighted sequential patterns are found. The Finding-WS(x, sdb_x, r) procedure finds all the weighted sequential patterns with the r-pattern x as their prefix patterns and is stated as follows.

The Finding-WS(x, sdbx, r) procedure:

Input: A prefix r-pattern x and its corresponding projected sequences sdbx. Output: The weighted sequential patterns with the prefix pattern x.

PSTEP 1: Initialize the temporary subsequence TSx table as an empty table, in which each tuple consists of three fields: subsequence, sequence-weighted upper bound (swub)

of the subsequence, and the actual weighted support (wsup) of the subsequence.

PSTEP 2: For each y-th sequence Seqy in sdbx, do the following substeps.

(a) Get each item i located after x in Seqy.

(b) Generate the (r+1)-subsequence S’ composed of the prefix r-pattern x and i; put the new (r+1)-subsequences in the temporary subsequence table. If the subsequence S’

does not appear in the temporary subsequence table, then put it in the table;

otherwise, remove the subsequence S’.

(c) For each unique (r+1)-subsequence in the temporary set of subsequences, add the sequence maximum weight smwy of the sequence Seqy and the weight wS’ of the subsequence S’ in the corresponding fields in the TSx table.

PSTEP 3: For each (r+1)-subsequence in the TSx table, do the following substeps.

(a) If the sequence-weighted upper bound swubS’ of the (r+1)-subsequence S’ is larger than or equal to the minimum weighted support threshold , put it in the set of weighted frequent upper bound (r+1)-patterns with x as their prefix sub-pattern, WFUB(r+1), x.

(b) If the actual weighted support wsupS’ of the (r+1)-pattern S’ is larger than or equal to the minimum weighted support threshold , put it in the set of weighted sequential (r+1)-patterns, WS x.

PSTEP 4: Gather the items that appear in the set of WFUB(r+1), x of x, and put them in the set

of possible items, PI(r+1), x.

PSTEP 5: Set r = r+1, where r represents the number of items in the processed subsequences.

PSTEP 6: For each y-th sequence Seqy in sdbx, do the following substeps.

(a) Check whether each item i in Seqy appears in PIr, x. If it does, then keep item i in Seqy; otherwise, remove item i from Seqy.

(b) If the number of items kept in the modified sequence Seqy is less than r+1, remove the modified sequence Seqy from sdbx; otherwise, keep it in sdbx.

PSTEP 7: Process each pattern S’ in the set of WFUBr, S in alphabetical order by the following substeps.

(a) Find the sequences that include S’ from sdbx, and then put them in the set of projected sequences sdbS’ of S’.

(b) If the relationship between the r-th item of S’ and each item i located after S’ in each sequence of sdbS’ is a weighted frequent upper bound, keep the item i in the projected sequences; otherwise, remove the item i from the projected sequences with pattern S’ as prefix.

(d) Calculate the new sequence maximum weight smwy of Seqy in sdbS’ as:

smwy = max{wy1, wy2, …, wyj}

where wyj is the weight value wi of each j-th item i in Seqy.

(e) Find all weighted sequential patterns with S’ as their prefix pattern by the Finding-WS(S’, sdbS’, r+1) procedure. Let the set of returned weighted sequential

patterns be WSS.

PSTEP 8: Return the set of weighted sequential patterns in all WSx.

在文檔中有效權重資料探勘方法之研究 (頁 81-88)