Construction of the In-memory Summary Data Structure

Chapter 5 Online Mining of Path Traversal Patterns over Web Click-Streams

5.3 The Proposed Algorithm: DSM-PLW

5.3.1 Construction of the In-memory Summary Data Structure

In this section, a new in-memory summary data structure, called SP-forest (Summary Path traversal pattern forest), is proposed to store the essential information about path traversal patterns of each incoming basic window, and an efficient algorithm is proposed to construct the summary data structure. Then, we use a running example to illustrate.

Definition 5-1 A Summary Path traversal pattern forest (abbreviated as SP-forest) is a prefix tree-based summary data structure defined below.

1. SP-forest consists of a list of frequent references (denoted by FR-list), such as r1, r₂, …, r_k, where ri.esup ≥ s⋅N, and a set of Path traversal pattern tree (abbreviated as Path-tree) of references r_i, denoted by r_i.Path-tree, ∀i = 1, 2, …, k.

2. Each node in the ri.Path-tree, ∀i = 1, 2, …, k, consists of four fields: fr_id, esup, mfr_id,

and node-link, where fr_id is the identifier of the incoming forward reference, esup registers the number of maximal forward references represented by a portion of the path reaching the node with the fr_id, the value of mfr_id assigned to a new node is the identifier of current maximal forward reference, and node-link links up a node with the next node with the same f_id in the SP-forest or null id if there is none.

3. Each entry ri, ∀i = 1, 2, …, k, in the FR-list consists of four fields: fr_id, esup, mfr_id, and head-link, where fr_id registers the forward reference identifier the entry represents, esup records the number of maximal forward references in the stream so far containing the reference with identifier fr_id, mfr_id assigned to a new entry is the identifier of the current maximal forward reference, and head-link is a pointer pointing to the root node of the fr_id.Path-tree.

Figure 5-2 gives the SP-forest construction algorithm. First of all, DSM-PLW algorithm reads a maximal forward reference MFRi = <r1, r2, …, rj, …, rm> from the buffer and maintains the SP-forest using the MFR-projection(MFR_i). The maintenance process is described as follows. For each reference r_j in MFR_i, if the reference r_j exists in the current FR-list, the estimated support of the reference, i.e., rj.esup, is increased by one. Otherwise, a new entry of the form (r_j, 1, i, rj) is created in the FR-list. Note that the notation rj

indicates the head-link of rj, and i is the current MFR’s identifier. Next, MFRi is projected into m reference-suffix maximal forward references (denoted by rs-MFRs) according to the order of references in the MFR_i. The step is called a maximal forward reference projection, and is denoted by MFR-projection(MFRi) = {r1|MFRi, r2|MFRi, …, rj|MFRi, …, rm|MFRi}, where rj|MFRi = <rjrj+1…rm>, ∀j = 1, 2, …, m.

For example, a maximal forward reference <acdef> is projected into five reference-suffix maximal forward references: <acdef>, <cdef>, <def>, <ef>, and <f>. Note that the cost of maximal forward reference projection is (m²+m)/2, i.e., m + (m−1) + … + 1. Next, these

rs-MFRs with prefix ri, ∀i = 1, 2, …, m, are inserted into the respective ri.Path-tree as branches. If an rs-MFR shares a prefix with an MFR already in the Path-tree, the new MFR will share a prefix of the branch representing that MFR. In addition, an estimated support counter is associated with each node in the Path-tree. The counter is updated when a reference-suffix maximal forward reference causes the insertion of a new branch. Figure 5-3 shows the subroutines of SP-forest construction and maintenance.

Algorithm SP-forest construction

Input: A stream of maximal forward references, MFR₁, MFR₂, …, MFR_N, and a user-defined minimum support threshold s ∈ (0, 1).

Output: A SP-forest so far.

1. FR-list = {}; /* initialize the FR-list to empty */

2. foreach MFRi = <r1, r2, …, rk> do /* ∀i = 1, 2, …, N, where N is the identifier of current MFR*/

3. foreach reference rj ∈ MFRi do /* ∀j = 1, 2, …, k */

4. if rj ∉ FR-list then

5. create a new entry of form (rj, 1, i, rj) into the FR-list;

6. else

7. rj.esup = rj.esup + 1;

8. end if

9. call MFR-projection(MFRi, rj);

10. end for 11. end for

12. call SP-pruning(SP-forest, N, s);

Figure 5- 2. Algorithm SP-forest construction

Subroutine MFR-projection

Input: A maximal forward reference MFR_i = <r₁, r₂, …, r_j, …, r_m>.

Output: rj.Path-tree, ∀j = 1, 2, …, m.

1. foreach reference rj, ∀j = 1, 2, …, m, in MFRi do 2. call Path-tree-maintenance(rj|MFRi, rj.Path-tree, i);

3. end for

Subroutine Path-tree-maintenance

Input: A reference-suffix maximal forward reference r_j|MFR_i = <r_jrj+1…rm>, r_j.Path-tree, and the identifier of current maximal forward reference i;

Output: A modified r_j.Path-tree, ∀j = 1, 2, …, m.

1. foreach reference rl, ∀l = j, j+1, …, m, in r_j|MFR_i do

2. if rl.Path-tree has a child node with id y such that y.fr_id = r_l.fr_id then 3. y.esup = y.esup+1;

4. else

5. create a new node of form (xl, 1, i) in the r_l.Path-tree;

6. end if 7. end for

Subroutine SP-pruning

Input: A SP-forest, a user-defined minimum support threshold s in the range of [0, 1], and the identifier of current maximal forward reference N.

Output: A SP-forest containing the set of all path traversal patterns.

1. foreach entry rj ∈ FR-list do 2. if rj.esup < s⋅N then 3. delete rj.Path-tree;

4. delete rj from FR-list;

5. delete the sub-trees of a node whose fr_id is j in other rl.Path-tree (l ≠ j) by traversing the node-links in the SP-forest;

6. end if 7. end for

Figure 5- 3. Subroutines of SP-forest construction algorithm

Figure 5- 4. SP-forest after processing the first maximal forward reference <acdef>

a:2:1

Figure 5- 5. SP-forest after processing the second maximal forward reference <abe>

Figure 5- 6. SP-forest after processing the first six maximal forward references

Example 5-1 Let the first six maximal forward references in the stream of Web click-sequences be <acdef>, <abe>, <cef>, <acdf>, <cef>, and <df>, where a, b, c, d, e, and f are Web references. The SP-forest with respect to the first two MFRs, <acdef> and <abe>, constructed by DSM-PLW algorithm is shown in Figure 5-4 and Figure 5-5, respectively.

Note that the dotted-line arrows, node-links, in Figure 5-4 are used to link up a node with the next node of the same fr_id in the current SP-forest. However, in the following steps, as demonstrated in Figure 5-5 through Figure 5-7, the node-links are omitted for concise presentation.

First, DSM-PLW algorithm reads the first maximal forward reference <acdef> from the buffer, and projects it into five reference-suffix maximal forward references: <acdef>, <cdef>,

<def>, <ef>, and <f>. Next, the algorithm inserts <acdef>, <cdef>, <def>, <ef>, and <f> into the empty trees, i.e., a.Path-tree, c.Path-tree, d.Path-tree, e.Path-tree, and f.Path-tree, respectively. The step results in a single path in each Path-tree: root(a:1:1) (a:1:1) (c:1:1)

(d:1:1) (e:1:1) (f:1:1), root(c:1:1) (c:1:1) (d:1:1) (e:1:1) (f:1:1), root(d:1:1) (d:1:1) (e:1:1) (f:1:1), root(e:1:1) (e:1:1) (f:1:1), and root(f:1:1) (f:1:1). The projected result is shown in Figure 5-4.

Then, DSM-PLW inserts the result of MFR-projection(<abe>): <abe>, <be>, and <e>

into a.Path-tree, b.Path-tree, and e.Path-tree, respectively. Hence, <abe> leads to one path with a being the common prefix: root(a:2:1) (a:2:1) (c:1:1) (d:1:1) (e:1:1) (f:1:1) and root(a:2:1) (a:2:1) (b:1:2) (e:1:2). Then, <be> results in a single path in b.Path-tree: root(b:1:2) (b:1:2) (e:1:2). Finally, DSM-PLW algorithm inserts <e> into the SP-forest. At this time, no new node is created, but the first path of e.Path-tree is changed to: root(e:2:1) (e:2:1) (f:1:1). After processing the second maximal forward reference

<abe>, the result is shown in Figure 5-5. After processing the six maximal forward references, the SP-forest is given in Figure 5-6.

在文檔中在串流資料中高效率頻繁樣式探勘演算法之研究 (頁 98-104)