An Example of PWSI - Projection-based Weighted Sequential Pattern Mining with Improved

CHAPTER 2 Review of Related Works

4.2 Projection-based Weighted Sequential Pattern Mining with Improved

4.2.6 An Example of PWSI

In this section, a simple example is given to illustrate how to find weighted sequential patterns from a sequence database using the proposed PWSI algorithm. Consider the five sequences in the sequence database shown in Table 4.1 with eight items, respectively denoted as A to H. The individual weights of the eight items are given in Table 4.2. In this example, the minimum weighted support thresholdλis set as 30%. The detailed process of the proposed algorithm is described below.

STEP 1: The sequence maximum weight for each sequence in SDB is first found. Take the first sequence Seq₁ in Table 4.1 as an example. The sequence Seq₁ includes three items, B, C, and B, whose weights are 0.15, 0.20, and 0.15, respectively. The maximum weight is 0.20,

which is regarded as the sequence maximum weight of the sequence Seq₁. The same process is used for the other four sequences in Table 4.1. The sequence maximum weights of all sequences are shown in Table 4.3.

Table 4.3: Sequence maximum weights of the five sequences in the example.

SID Sequence smw_y

Seq₁ <BCB> 0.20

Seq2 <HFD> 0.95

Seq3 <ACF(DE)F> 0.55

Seq₄ <AGFH> 0.95

Seq₅ <AFDEC> 0.55

STEP 2: According to sequence maximum weight (smw) of each sequence in Table 4.3, the total sequence maximum weight (tsmw) can be calculated as 0.20 + 0.95 + 0.55 + 0.95 + 0.55 = 3.20.

STEP 3: The sequence-weighted upper bound (swub) and weighted support (wsup) of each possible item in SDB are found simultaneously. Take item A in Table 4.3 as an example.

Item A appears in the sequences Seq₃, Seq₄, and Seq₅, whose sequence maximum weights are 0.55, 0.95 and 0.55. In addition, the weight of item A in Table 4.2 is 0.10, and the total sequence maximum weight tsmw is 3.20. The sequence-weighted upper bound swub_A of item A can be then calculated as (0.55 + 0.95 + 0.55) / 3.2, which is 64.06%, and its weighted

support wsup_A can be calculated as (0.10 + 0.10 + .010) / 3.2, which is 9.37%. The other possible items in SDB can be processed in the same fashion. The results for the sequence-weighted upper bounds and the weighed supports of all possible 1-subsequences in SDB are shown in Table 4.4.

Table 4.4: Sequence-weighted upper bounds and weighted supports of all 1-subsequences in support thresholdλ(= 30%), <D> is a weighted frequent upper-bound 1-pattern. However,

<D> is not a weighted sequential pattern due to its weighted support (= 28.12%). The other seven 1-subsequences in Table 4.4 can be processed in the same way. After this step, the set of weighted frequent upper-bound 1-patterns (WFUB₁) includes <A>, <C>, <D>, <E>, <F>, and <H>. The two 1-subsequences <F>, and <H> are put in the set of weighted sequential 1-patterns (WS₁), as shown in Table 4.5 and Table 4.6.

Table 4.5: Set of the weighted frequent upper-bound 1-patterns in the example.

Table 4.6: Set of weighted sequential 1-patterns in the example.

Subsequence wsup

<F> 68.75%

<H> 59.37%

STEP 5: The variable r is initially set to 1, where r represents the number of items in the subsequences to be processed.

STEP 6: In this example, the six items A, C, D, E, F, and H are collected from the six 1-patterns in Table 4.5. The possible items are denoted as PI1.

STEP 7: For each sequence in Table 4.3, the items not appearing in the set of PI1 are removed from the sequence. Take the first sequence Seq1 in Table 4.3 as an example. The first sequence includes the three items B, C, and B. The sequence maximum weight of Seq1 is 0.20.

In this example, since the first and third items in Seq1 do not appear in the set of PI1, only item C in Seq1 can be kept. The sequence is thus modified as <C>. The sequence maximum weight of the modified sequence is still 0.20. However, the modified sequence <C> can be

removed from Table 4.3 because no 2-subsequences can be generated from the sequence. The other four sequences in Table 4.3 can similarly be processed. The results for all the modified sequences and their sequence maximum weights are shown in Table 4.7.

Table 4.7: All modified sequences and their sequence maximum weights in the example.

Sequence smw_y

<HFD> 0.95

<ACF(DE)F> 0.55

<AFH> 0.95

<AFDEC> 0.55

STEP 8: Each 1-pattern in the set of WFUB₁ is sequentially processed from the last one to the first one in alphabetical order. The 1-pattern <H> is thus processed first. To simplify the description of the example, assume the weighted frequent upper-bound patterns with the five prefix patterns, <C>, <D>, <E>, <F> and <H> in the set of WFUB₁, except for pattern

<A>, are found by using the Finding-WS(x, sdb_x, r) procedure, as shown in Table 4.8. In addition, since the information needs to be required by the filtering strategy, the weighted frequent upper-bound patterns are shown in Table 4.9.

Table 4.8: Set of weighted sequential patterns with the five patterns as their prefix patterns in

Table 4.9: Set of weighted frequent upper-bound patterns with the five patterns as their prefix patterns in the example. projected and put in the projected sequences sdb<A> of <A>. Note that only the items located after pattern <A> for each sequence in sdb<A> are kept. Take the third sequence <AFDEC> in sdb<A> as an example. Since only the four items, C, D, E and F, are located after pattern <A>

in the sequence, the projected sequence is <AFDEC>. After this, sdb<A> includes the following three projected sequences, <ACF(DE)F>, <AFH> and <AFDEC>, whose sequence maximum weights are 0.55, 0.95 and 0.55, respectively.

Next, all the weighted sequential patterns with the prefix <A> are found by using the Finding-WS(x, sdbx, r) procedure with the parameters x = <A> and r = 1. The details of the

Finding-WS(x, sdbx, r) procedure are described below.

PSTEP 1: The temporary subsequence table, TS<A>, is initialized as an empty table, in which each tuple consists of three fields: subsequence, sequence-weighted upper bound (swub) of the subsequence, and actual weighted support (wsup) of the subsequence.

PSTEP2: For each projected sequence in sdb<A>, all possible 2-subsequences with the prefix <A> in it are produced. Take the second projected sequence <AFH> as an example.

Three unique 2-subsequences, <AF>, and <AH>, can be generated from the sequence <AFH>.

These subsequences are put in the TS_<A> table. The weights of the two subsequences (0.325 and 0.525, respectively) and the sequence maximum weight (= 0.95) of the sequence are put in the suitable fields of the 2-subsequences in the TS_<A> table. The other two sequences in sdb_<A> can be similarly processed. The results for all 2-subsequences with the prefix <A> in sdb_<A> are shown in Table 4.10.

Table 4.10: Sequence-weighted upper-bound and actual weighted support values of all 2-subsequences with prefix <A> in the example.

Subsequence swub wsup

PSTEP 3: As mentioned in STEP 4, the weighted frequent upper-bound 2-patterns (WFUB2,<A>) and the weighted sequential 2-patterns (WS<A>) with the prefix <A> in Table 4.10 can be found simultaneously. After this step, the four 2-subsequences, <AC>, <AD>,

<AE>, and <AF>, are put in the set of WFUB2,<A>. Only <AF> in Table 4.10 is put in the set of WS<A>.

PSTEP 4: In this example, only the five items A, C, D, E and F are collected from the set of WFUB2,<A> with the prefix <A>. They are denoted as PI2,<A>.

PSTEP 5: The value of the variable r is updated as 2.

PSTEP 6: The items not appearing in PI2,<A> in each sequence in sdb_<A> are removed from the sequence, as mentioned in STEP 7. The modified sequences in sdb_<A>are shown in Table 4.11.

Table 4.11: All modified sequences in sdb<A> in the example.

SID Sequence smw_y

Seq₁ <ACF(DE)F> 0.55

Seq2 <AF> 0.95

Seq3 <AFDEC> 0.55

PSTEP 7: Each 2-pattern in the set of WFUB2,<A> is processed in alphabetical order. The 2-pattern <AC> in the set of WFUB2,<A> is thus processed first. The first and last projected

sequences in sdb<A>, <ACF(DE)F> and <AC>, are put in sdb<AC>. Since no 2-patterns with

the prefix pattern <C> exist in Table 4.9, the three items, D, E and F can be removed from the projected database with prefix pattern <AC>, and then transactions are modified as <AC> and

<AC>. Because the numbers of items kept in the <AC> sequences in sdb<AC> is less than the value of 3, the sequences are removed from the set of the projected sequences sdb<AC>. Next, the weighted sequential patterns with the prefix <AC> are found by recursively invoking the Finding-WS(x, sdbx, r) procedure with the parameters x = <AC>, sdbx = sdb<AC> and r = 2.

The other patterns in WFUB1 can be recursively processed in the same way until all the 1-patterns in the set of WFUB1 have been done. All the weighted sequential patterns in this example are then found, as shown in Table 4.12.

Table 4.12: Final set of all weighted sequential patterns (WS) in the example.

Pattern wsup

<F> 68.75%

<H> 59.31%

<AF> 30.46%

<FD> 39.84%

STEP 9: In this example, the five weighted sequential patterns in Table 4.12 are output to users as auxiliary information for decision-making.

As shown in this example, the execution efficiency of PWSI for finding weighted sequential patterns is thus improved.

4.3 Experimental Evaluation

A series of experiments were conducted to compare the performance of the proposed projection-based weighted sequential pattern mining algorithm (PWS), projection-based weighted sequential pattern mining algorithm with improved strategies (PWSI), and the traditional weighted sequential pattern mining approach (WSpan) for various parameter values. The all algorithms were implemented in J2SDK 1.6.0 and executed on a PC with a 3.30 GHz CPU and 4 GB memory. The default values of parameters used in the experiments are listed in Table 4.13.

Table 4.13: Parameters used in the experiment.

Parameter Description Default Value

S The average length of a sequence 6

T The average length of items per transaction 6

I The average length of maximal potentially frequent itemsets 2

N The total number of items 2,000

D The total number of transactions 200,000

min_wsup The minimum weighted support threshold 0.18%

4.3.1 Experimental Datasets

As previously mentioned in Section 3.4, since it was difficult to obtain real datasets, the public IBM data generator was still used to generate the required experiment databases. The weight-value distribution of all items in the weight table for weighted sequential pattern mining is shown in Figure 4.1.

Figure 4.1: Weight-value distribution of generated transaction datasets.

Moreover, the real database, Foodmart, was adopted to evaluate the practical performance of the proposed algorithms, again. Note that since the bought customer information of each transaction was attached into that transaction, the transactions in the real database for a customer could be listed in time order as a time-series. Finally, the transformed sequences in the real database, Foodmart, were used to evaluate the performance of the proposed two algorithms for various parameter settings.

在文檔中有效權重資料探勘方法之研究 (頁 88-98)