Efficient Data Mining for Path Traversal Patterns

(1)

Efficient Data Mining

for Path Traversal Patterns

Ming-Syan Chen, Senior Member, IEEE,

Jong Soo Park, Member, IEEE, and Philip S. Yu, Fellow, IEEE

Abstract—In this paper, we explore a new data mining capability that involves mining path traversal patterns in a distributed information-providing environment where documents or objects are linked together to facilitate interactive access. Our solution procedure consists of two steps. First, we derive an algorithm to convert the original sequence of log data into a set of maximal forward references. By doing so, we can filter out the effect of some backward references, which are mainly made for ease of traveling and concentrate on mining meaningful user access sequences. Second, we derive algorithms to determine the frequent traversal patterns¦i.e., large reference sequences¦from the maximal forward references obtained. Two algorithms are devised for determining large reference sequences; one is based on some hashing and pruning techniques, and the other is further improved with the option of determining large reference sequences in batch so as to reduce the number of database scans required.

Performance of these two methods is comparatively analyzed. It is shown that the option of selective scan is very advantageous and can lead to prominent performance improvement. Sensitivity analysis on various parameters is conducted.

Index Terms—Data mining, traversal patterns, distributed information system, World Wide Web, performance analysis.

——————————_{F——————————}

1 I

NTRODUCTION

UE to the increasing use of computing for various ap-plications, the importance of database mining is growing at a rapid pace recently. Progress in bar-code technology has made it possible for retail organizations to collect and store massive amounts of sales data. Catalog companies can also collect sales data from the orders they received. It is noted that analysis of past transaction data can provide very valuable information on customer buying behavior, and thus improve the quality of business decisions (such as what to put on sale, which merchandises to be placed to-gether on shelves, how to customize marketing programs, to name a few). It is essential to collect a sufficient amount of sales data before any meaningful conclusion can be drawn therefrom. As a result, the amount of these proc-essed data tends to be huge. It is hence important to devise efficient algorithms to conduct mining on these data.

Note that various data mining capabilities have been ex-plored in the literature. One of the most important data mining problems is mining association rules [3], [4], [13], [15]. For example, given a database of sales transactions, it is desirable to discover all associations among items such that the presence of some items in a transaction will imply the presence of other items in the same transaction. Also, mining classification is an approach of trying to develop rules to group data tuples together based on certain

common features. This has been explored both in the AI domain [16], [17] and in the context of databases [2], [6], 12]. Mining in spatial databases was conducted in [14]. An-other source of data mining is on ordered data, such as stock market and point of sales data. Interesting aspects to explore from these ordered data include searching for similar sequences [1], [19], e.g., stocks with similar move-ment in stock prices, and sequential patterns [5], e.g., gro-cery items bought over a set of visits in sequence. It is noted that data mining is a very application-dependent issue and different applications explored will require different mining techniques to cope with. Proper problem identification and formulation is therefore a very important part of the whole knowledge discovery process.

In this paper, we shall explore a new data mining capa-bility which involves mining access patterns in a distrib-uted information-providing environment where documents or objects are linked together to facilitate interactive access. Examples for such information-providing environments include World Wide Web (WWW) [11] and on-line services where users, when seeking for information of interest, travel from one object to another via the corresponding fa-cilities (i.e., hyperlinks) provided. Clearly, understanding user access patterns in such environments will not only help improve the system design (e.g., provide efficient ac-cess between highly correlated objects, better authoring design for pages, etc.) but also be able to lead to better mar-keting decisions (e.g., putting advertisements in proper places, better customer/user classification and behavior analysis, etc.). Capturing user access patterns in such envi-ronments is referred to as mining traversal patterns in this paper. Note that although some efforts have elaborated upon analyzing the user behavior [8], [9], [10], there is little result reported on dealing with the algorithmic aspects to

• M.-S. Chen is with the Electrical Engineering Department, National Taiwan University, Taipei, Taiwan, Republic of China.

E-mail: mschen@cc.ee.ntu.edu.tw.

• J.S. Park is with the Department of Computer Science, Sungshin Women’s University, Seoul, Korea. E-mail: jpark@cs.sungshin.ac.kr.

• P.S. Yu is with the IBM Thomas J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598. E-mail: psyu@watson.ibm.com.

Manuscript received 8 Aug. 1996.

For information on obtaining reprints of this article, please send e-mail to: tkde@computer.org, and reference IEEECS Log Number 104467.

(2)

improve the execution of traversal pattern mining. This can be in part explained by the reason that these information-providing services, though with great potential, are mostly in their infancy and their customer analysis may still re-main in a coarser level such as user occupation/age study. In addition, it is important to note that, since users are traveling along the information-providing services to search for the desired information, some objects are visited because of their locations rather than their content, showing the very difference between the traversal pattern problem and others which are mainly based on customer transactions. This unique feature of the traversal pattern problem unavoidably increases the difficulty of extracting meaningful information from a sequence of traversal data. However, as these information-providing services are be-coming increasingly popular nowadays, there is a growing demand for capturing user behavior and improving the quality of such services. As a result, the problem of mining traversal patterns has become too important not to address immediately.

Consequently, we shall explore in this paper the problem of mining traversal patterns. Our solution procedure con-sists of two steps. First, we derive an algorithm, called algo-rithm MF (standing for maximal forward references), to convert the original sequence of log data into a set of tra-versal subsequences. As defined in Section 2, each tratra-versal subsequence represents a maximal forward reference from the starting point of a user access. As will be explained later, this step of converting the original log sequence into a set of maximal forward references will filter out the effect of backward references which are mainly made for ease of traveling, and enable us to concentrate on mining meaning-ful user access sequences. Secondly, we derive algorithms to determine the frequent traversal patterns, termed large reference sequences, from the maximal forward references obtained above, where a large reference sequence is a refer-ence sequrefer-ence that appeared in a sufficient number of times in the database. Note that the problem of finding large ref-erence sequences is similar to that of finding large itemsets for association rules [3], where a large itemset is a set of items appearing in a sufficient number of transactions. However, they are different from each other in that a refer-ence sequrefer-ence in mining traversal patterns has to be con-secutive references in a maximal forward reference whereas a large itemset in mining association rules is just a combi-nation of items in a transaction. As a consequence, although several schemes for mining association rules have been re-ported in the literature [3], [4], [15], the very difference be-tween these two problems calls for the design of new algo-rithms for determining large reference sequences.

Explicitly, we ulitize two algorithms for determining large reference sequences. The first one, referred to as full-scan (FS) algorithm, essentially utilizes some techniques on hashing and pruning while solving the discrepancy between traversal patterns and association rules mentioned above. Although trimming the transaction database as it proceeds to later passes, algorithm FS is required to scan the transaction database in each pass. In contrast, by prop-erly utilizing the candidate reference sequences, the second algorithm devised, referred to as selective-scan (SS)

algo-rithm, is able to avoid database scans in some passes so as to reduce the disk I/O cost involved. Specifically, algorithm SS has the option of using a candidate reference set to gen-erate subsequent candidate reference sets, and delaying the determination of large reference sets to a later pass when the database is scanned. Since SS does not scan the database to obtain a large reference set in each pass, some database scans are saved. It is noted that, although the concept of selective scan was used in [15] for mining association rules, its implementation and performance implication are differ-ent when it is employed for mining path traversal patterns. Experimental studies are conducted by using a synthetic workload that is generated based on referencing some logged traces, and performance of these two methods, FS and SS, is comparatively analyzed. It is shown that the option of selective scan is very advantageous and algo-rithm SS thereby outperforms algoalgo-rithm FS in general. Sen-sitivity analysis on various parameters is also conducted.

This paper is organized as follows. Problem formulation is given in Section 2. Algorithm MF to identify maximal forward references is described in Section 3.1, and two al-gorithms, FS and SS, for determining large reference se-quences are given in Section 3.2. Performance results are presented in Section 4. Section 5 contains the summary.

2 P

ROBLEM

F

ORMULATION

As pointed out earlier, in an information-providing envi-ronment where objects are linked together, users are apt to traverse objects back and forth in accordance with the links and icons provided. As a result, some node might be revis-ited because of its location, rather than its content. For ex-ample, in a WWW environment, to reach a sibling node a user is usually inclined to use “backward” icon and then a forward selection, instead of opening a new URL. Conse-quently, to extract meaningful user access patterns from the original log database, we naturally want to take into con-sideration the effect of such backward traversals and dis-cover the real access patterns of interest. In view of this, we assume in this paper that a backward reference is mainly made for ease of traveling but not for browsing, and concentrate on the discovery of forward reference pat-terns. Specifically, a backward reference means revisit-ing a previously visited object by the same user access. When backward references occur, a forward reference path terminates. This resulting forward reference path is termed a maximal forward reference. After a maximal forward refer-ence is obtained, we back track to the starting point of the next forward referencing and resume another forward reference path.

While deferring the formal description of the algorithm to determine maximal forward references (i.e., algorithm MF) to Section 3.1, we give an illustrative example for maximal forward references below. Suppose the tra-versal log contains the following tratra-versal path for a user: {A, B, C, D, C, B, E, G, H, G, W, A, O, U, O, V}, as shown in Fig. 1. Then, it can be verified by algorithm MF that the set of maximal forward references for this user is {ABCD, ABEGH, ABEGW, AOU, AOV}. After maximal for-ward references for all users are obtained, we then map the

(3)

problem of finding frequent traversal patterns into the one of finding frequent occurring consecutive subsequences among all maximal forward references. A large reference se-quence is a reference sese-quence that appeared in a sufficient number of times. In a set of maximal forward references, the number of times a reference sequence has to appear in order to be qualified as a large reference sequence is called the minimal support. A large k-reference is a large reference sequence with k elements. We denote the set of large k-references as Lk and its candidate set as Ck, where Ck, as

obtained from L_k_-₁ [4], contains those k-references that may appear in Lk. Explicitly, Ck is a superset of Lk.

Fig. 1. An illustrative example for traversal patterns.

It is worth mentioning that after large reference se-quences are determined, maximal reference sese-quences can then be obtained in a straightforward manner. A maximal refer-ence sequrefer-ence is a large referrefer-ence sequrefer-ence that is not con-tained in any other maximal reference sequence. For exam-ple, suppose that {AB, BE, AD, CG, GH, BG} is the set of large two-references (i.e., L2) and {ABE, CGH} is the set of

large three-references (i.e., L₃). Then, the resulting maximal reference sequences are AD, BG, ABE, and CGH. A maximal reference sequence corresponds to a “hot” access pattern in an information-providing service. In all, the entire procedure for mining traversal patterns can be summarized as follows.

Procedure for mining traversal patterns:

Step 1: Determine maximal forward references from the original log data.

Step 2: Determine large reference sequences (i.e., Lk, k 1)

from the set of maximal forward references. Step 3: Determine maximal reference sequences from large

reference sequences.

Since the extraction of maximal reference sequences from large reference sequences (i.e., Step 3) is straightforward, we shall henceforth focus on Steps 1 and 2, and devise algo-rithms for the efficient determination of large reference sequences.

3 A

LGORITHM FOR

T

RAVERSAL

P

ATTERN

We shall describe in Section 3.1 algorithm MF which con-verts the original traversal sequence into a set of maximal forward references. Then, by mapping the problem of finding frequent traversal patterns into the one of finding frequent consecutive subsequences, we develop two algo-rithms, called full-scan (FS) and selective-scan (SS), for mining traversal patterns.

3.1Identifying Maximal Forward References

In general, a traversal log database contains, for each link traversed, a pair of (source, destination). This part of log database is called referer log [7]. For the beginning of a new path, which is not linked to the previous traversal, the source node is null. Given a traversal sequence {(s1, d1),

(s2, d2), ..., (sn, dn)} of a user, we shall map it into multiple

subsequences, each of which represents a maximal forward reference. The algorithm for finding all maximal forward references is given as follows. First, the traversal log data-base is sorted by user IDs, resulting in a traversal path, {(s1, d1), (s2, d2), ..., (sn, dn)}, for each user, where pairs of

(s_i, d_i) are ordered by time. Algorithm MF is then applied to each user path to determine all of its maximal forward ref-erences. Let DF denote the database to store all the resulting

maximal forward references obtained.

Algorithm MF: An algorithm to find maximal forward references

Step 1: Set i = 1 and string Y to null for initialization, where string Y is used to store the current forward reference path. Also, set the flag F = 1 to indicate a forward traversal.

Step 2: Let A = si and B = di .

If A is equal to null then

/* this is the beginning of a new traversal */ begin

Write out the current string Y (if not null) to the database D_F;

Set string Y = B; Go to Step 5. end

Step 3: If B is equal to some reference (say the jth refer-ence) in string Y then

/* this is a cross-referencing back to a previous reference */

begin

If F is equal to 1 then write out string Y to database DF;

Discard all the references after the jth one in string Y;

F = 0; Go to Step 5. end

Step 4: Otherwise, append B to the end of string Y. /* we are continuing a forward traversal */ If F is equal to 0, set F = 1.

Step 5: Set i = i + 1. If the sequence is not completed scanned then go to Step 2.

(4)

Consider the traversal scenario in Fig. 1 for example. It can be verified that the first backward reference is encoun-tered in the fourth move (i.e., from D to C). At that point, the maximal forward reference ABCD is written to DF (by

Step 3). In the next move (i.e., from C to B), although the first conditional statement in Step 3 is again true, nothing is written to D_F since the flag F = 0, meaning that it is in a re-verse traversal. The subsequent forward references will put ABEGH into the string Y, which is then written to DF when

a reverse reference (from H to G) is encountered. The exe-cution scenario by algorithm MF for the input in Fig. 1 is given in Table 1.

TABLE 1

AN EXAMPLE EXECUTION BY ALGORITHM MF move string Y output to DF

1 AB -2 ABC -3 ABCD -4 ABC ABCD 5 AB -6 ABE -7 ABEG -8 ABEGH -9 ABEG ABEGH 10 ABEGW -11 A ABEGW 12 AO -13 AOU -14 AO AOU

15 AOV AOV (end)

It is noted that in some cases, the traversal log record obtained only contains the destination references instead of a pair of references. For example, for WWW browsing, the request message may only contain the destination URL. The traversal sequence will then have the form {d1, d2, ..., dn} for

each user. Even with such an input, we can still convert it into a set of maximal forward references. The only differ-ence is that in this case we cannot identify the breakpoint where the user picks a new URL to begin a new traversal path, meaning that two consecutive maximal forward refer-ences; e.g., ABEH and WXYZ, may be treated as one path, i.e., ABEHWXYZ. Certainly, this constraint, i.e., without the IDs of source nodes, could increase the computational complexity because the paths considered become longer. However, this constraint should have little effect on identi-fying frequent reference subsequences. Since there is no logical link between H and W, a subsequence containing HW is unlikely to occur frequently. Hence, a reference con-taining the pattern HW will unlikely emerge as a large ref-erence later. Therefore, algorithm MF can in fact be em-ployed for the case when the IDs of source nodes are not available.

3.2 Determining Large Reference Sequences

Once the database containing all maximal forward refer-ences for all users, D_F, is constructed, we can derive the frequent traversal patterns by identifying the frequent oc-curring reference sequences in DF. A sequence s1, ...., sn is

said to contain r1, ...., rk as a consecutive subsequence if

there exists an i such that si+j = rj, for 1  j k. For example,

BAHPM is said to contain AHP. A sequence of k references, r1, ...., rk, is called a large k-reference sequence, if there are a

sufficient number of users with maximal forward references in D_F containing r₁, ...., r_k as a consecutive subsequence.

As pointed out before, the problem of finding large ref-erence sequences is different from that of finding large itemsets for association rules and thus calls for the design of new algorithms. Consequently, we shall derive in this paper two algorithms for mining traversal patterns. The first one, called full-scan (FS) algorithm, essentially utilizes the con-cept of DHP [15] (i.e., hashing and pruning) while solving the discrepancy between traversal patterns and association rules. DHP has two major features in determining associa-tion rules: one is efficient generaassocia-tion for large itemsets and the other is effective reduction on transaction database size after each scan. Although trimming the database as it pro-ceeds to later passes, FS is required to scan the database in each pass. In contrast, by properly utilizing the candidate reference sequences, the second algorithm, referred to as selective-scan (SS) algorithm, is improved with the option of determining large reference sequences in batch so as to reduce the number of database scans required.

3.2.1 Algorithm on Full Scan (FS)

Algorithm FS utilizes key ideas of the DHP algorithm. The details of DHP can be found in [15]. An example scenario for determining large itemsets and candidate itemsets is given in the Appendix.1 As shown in [15], by utilizing a hash technique, DHP is very efficient for the gen-eration of candidate itemsets, in particular for the large two-itemsets, thus greatly improving the performance bot-tleneck of the whole process. In addition, DHP employs effective pruning techniques to progressively reduce the transaction database size.

Recall that Lk represents the set of all large k-references

and C_k is a set of candidate k-references. C_k is in general a superset of Lk. By scanning through DF, FS gets L1 and

makes a hash table (i.e., H2) to count the number of

oc-currences of each two-reference. Similarly to DHP, starting with k = 2, FS generates Ck based on the hash table

count obtained in the previous pass, determines the set of large k-references, reduces the size of database for the next pass, and makes a hash table to determine the candidate (k + 1)-references. Note that as in mining association rules, a set of candidate references, Ck, can be generated from

join-ing Lk-1 with itself, denoted by Lk-1 * Lk-1. 2

However, due to the difference between traversal patterns and association rules, we modify this approach as follows. For any two dis-tinct reference sequences in Lk-1, say r1, ...., rk-1 and s1, ....,

s_k_-₁, we join them together to form a k-reference sequence only if either r1, ...., rk-1 contains s1, ...., sk-2 or s1, ...., sk-1

contains r₁, ...., r_k_-₂ (i.e., after dropping the first element in one sequence and the last element in the other sequence,

1. In this example, the technique of hashing, which is employed by DHP to reduce the number of candidate itemsets, is not shown.

2. This approach of generating C_k directly from L_k_-₁ is proposed by algo-rithm Apriori in [4] in generating candidate itemsets for association rules.

(5)

the remaining two (k - 2)-references are identical). We note that when k is small (especially for the case of k = 2), deriv-ing C_k by joining L_k_-₁ with itself will result in a very large number of candidate references and the hashing technique is thus very helpful for such a case. As k in-creases, the size of L_k_-₁_{* L}_k_-₁ can decrease significantly. Same as in [15], we found that it is generally beneficial for FS to generate C_k directly from L_k_-₁_{* L}_k_-₁ (i.e., without using hashing) once k 3.

To count the occurrences of each k-reference in Ck to

de-termine L_k, we need to scan through a trimmed version of database DF. From the set of maximal forward references,

we determine, among k-references in C_k, large k-references. After the scan of the entire database, those k-references in Ck

with count exceeding the threshold become L_k. If L_k is nonempty, the iteration continues for the next pass, i.e., pass k + 1. Same as in DHP, every time when the database is scanned, the database is trimmed by FS to improve the effi-ciency of future scans.

3.2.2 Algorithm on Selective Scan (SS)

Algorithm SS is similar to algorithm FS in that it also employs hashing and pruning techniques to reduce both CPU and I/O costs, but is different from the latter in that algorithm SS, by properly utilizing the information in can-didate references in prior passes, is able to avoid database scans in some passes, thus further reducing the disk I/O cost. The method for SS to avoid some database scans and reduce disk I/O cost is described below. Recall that algorithm FS generates a small number of candidate two-references by using a hashing technique. In fact, this small C₂ can be used to generate the candidate three-references. Clearly, a C′₃ gener-ated from C₂_{* C}₂, instead of from L₂_{* L}₂, will have a size greater than |C3| where C3 is generated from L2 * L2.

How-ever, if |C₃′| is not much larger than |C3|, and both C2 and

′

C₃ can be stored in the main memory, we can find L2 and L3

together when the next scan of the database is performed, thereby saving one round of database scan. It can be seen that using this concept, one can determine all Lks by as few

as two scans of the database (i.e., one initial scan to deter-mine L1 and a final scan to determine all other large

refer-ence sequrefer-ences), assuming that C_k′ for k 3 is generated from C_{k 1}′₋ and all C_k′s for k > 2 can be kept in the memory.

Note that when the minimum support is relatively small or potentially large references are long, Ck and Lk could

become large. With C_{k 1}′₊ being generated from C′_k_*C_k′, if |C′_{k 1}₊ | > |C′_k| for k 2, then it may cost too much CPU time to generate all subsequent C′_j, j > k + 1, from candidate sets of large references since the size of Cj may become huge

quickly, thus compromising all the benefit from saving disk I/O cost. For the illustrative example in the Appendix, if C3

was determined from C₂_{* C}₂, instead of from L₂_{* L}₂, then C3 would be {{ABC}, {ABE}, {ACE}, {BCE}}. This fact

sug-gests that a timely database scan to determine large refer-ence sequrefer-ences will in fact pay off. After a database scan, one can obtain the large reference sequences which are not determined thus far (say, up to Lm) and then construct the

set of candidate (m + 1)-references, C_m+1, based on L_m from that point. According to our experiments, we found that if |C′_{k 1}₊ | > |C′_k| for some k 2, it is usually beneficial to have a database scan to obtain Lk+1 before the set of candidate

references becomes too big. (Same as in FS, each time the database is scanned, the database is trimmed by SS to im-prove the efficiency of future scans.) We then derive C_{k 2}′₊ from Lk+1. (We note that C′k 2+ is in fact equal to Ck+2 here.)

After that, we again use C′_j to derive C′_{j 1}₊ for j k + 2. The pro-cess continues until the set of candidate (j + 1)-references becomes empty.

Illustrative examples for FS and SS are given in Table 2 where the number of reference paths |D| = 200,000 and the minimum support s = 0.75 percent. Extensive experiments are conducted in Section 4. In this example run, FS per-forms a database scan in each pass to determine the corre-sponding large reference sequences, resulting in six data-base scans. On the other hand, SS scans the datadata-base only three times (skipping database scans in passes 2, 4, and 5), and is able to obtain the same result. The CPU and disk I/O times for FS are 19.48 seconds and 30.8 seconds, respectively, whereas those for SS are 18.75 seconds and 17.8 seconds, respectively. Considering both CPU and I/O times, the exe-cution time ratio for SS to FS is 0.73, showing that the con-cept of selective scan is useful not only for mining associa-tion rules [15] but also for mining path traversal patterns.

4 P

ERFORMANCE

R

ESULTS

To assess the performance of FS and SS, we conducted sev-eral experiments to determine large reference sequences by using an RS/6000 workstation with model 560. The TABLE 2

RESULTS FROM AN EXAMPLE RUN BY FS AND SS

k 1 2 3 4 5 6 time (sec) Algorithm FS C_k 121 84 58 22 3 L_k 94 91 84 58 21 3 19.48 Dk 12.8 MB 12.8 MB 12.2 MB 5.3 MB 1.9 MB 0.26 MB 30.80 Algorithm SS Ck 121 144 58 22 3 Lk 94 91 84 58 21 3 18.75 D_k 12.8 MB - 12.8 MB - - 5.3 MB 17.80

(6)

methods used to generate synthetic data are described in Section 4.1. Performance comparison of these two methods is given in Section 4.2. Sensitivity analysis is conducted in Section 4.3.

4.1 Generation of Synthetic Traversal Paths

In our experiment, the browsing scenario in a World Wide Web (WWW) environment is simulated. To generate a syn-thetic workload and determine the values of parameters, we referenced some logged traces which were collected from a gateway in our work location [18]. First, a traversal tree is constructed to mimic WWW structure whose starting position is a root node of the tree. The traversal tree consists of internal nodes and leaf nodes. Fig. 2a shows an example of the traversal tree. The number of child nodes at each in-ternal node, referred to as fanout, is determined from a uni-form distribution within a given range. The height of a subtree whose subroot is a child node of the root node is determined from a Poisson distribution with mean mh.

Then, the height of a subtree whose subroot is a child of an internal node Ni is determined from a Poisson distribution

with mean equal to a fraction of the maximum height of the internal node Ni. As such, the height of a tree is controlled

by the value of m_h.

A traversal path consists of nodes accessed by a user. The size of each traversal path is picked from a Poisson distri-bution with mean equal to |P|. With the first node being the root node, a traversal path is generated probabilistically within the traversal tree as follows. For each internal node, we determine which is the next hop according to some pre-determined probabilities. Essentially, each edge connecting to an internal node is assigned with a weight. This weight corresponds to the probability that each edge will be next accessed by the user. As shown in Fig. 2b, the weight to its parent node is assigned with p₀, which is generally 1

1

n+

where n is the number of child nodes. This probability of traveling to each child node, pi, is determined from an

ex-ponential distribution with unit mean, and is so normalized that the sum of the weights for all child nodes is equal to 1 - p0. Some internal nodes in the tree allow internal jumps

which can go to any other nodes. If an internal node has an internal jump and the weight for this jump is p_j, then p₀ is changed to p₀(1 - p_j) and the corresponding probability for each child node is changed to p_i(1 - p_j) such that the sum of all the probabilities associated with this node remains one. When the path arrives at a leaf node, the next move would be either to its parent node in backward (with a probability 0.25) or to any internal node (with an aggregate probability 0.75). The number of internal nodes with internal jumps is denoted by N_J, which is set to 3 percent of all the internal nodes in general cases. The sensitivity of varying N_J will also be analyzed. Those nodes with internal jumps are de-cided randomly among all the internal nodes. Table 3 summarizes the meaning of various parameters used in our simulations.

4.2Performance Comparison between FS and SS

Fig. 3 represents execution times of two methods, FS and SS, when |D| = 200,000, NJ = 3 percent, and pj = 0.1. HxPy

means that x is the height of a tree and y is the average size of the reference paths. D200K means that the number of reference paths is 200,000. A tree for H10 was obtained when the height of a tree is 10 and the fanout at each inter-nal node is between 4 and 7. The root node consists of seven child nodes. The number of internal nodes is 16,200 and the number of leaf nodes is 73,006. The number of in-ternal nodes with inin-ternal jumps is thus 16,200  NJ = 486.

Note that the total number of nodes increases as the height of a tree increases. To make the experiment tractable, we reduced the fanout to 2 - 5 for the tree of H20 with the height of 20. This tree contained 616,595 internal nodes and 1,541,693 leaves. In Fig. 3, the left graph of each HxPy.D200K represents the CPU time to find all the large reference sequences, and the right graph shows the I/O time to find them where the disk I/O time is set to 2 MB/sec and 1 MB buffer is used in main memory. It can

(a)

(b)

(7)

be seen from Fig. 3 that algorithm SS in general outper-forms FS, and their performance difference becomes prominent when the I/O cost is taken into account.

To provide more insights into their performance, in ad-dition to Table 2 in Section 3, we have Table 4, which shows the results by these two methods when |D| = 200,000 and s = 0.75 percent. In Table 4, FS scans the database eight times

to find all the large reference sequences, whereas SS only involves three database scans. Note that after initial scans, disk I/O involved by FS and SS will include both disk read and disk write (i.e., writing the trimmed version of the da-tabase back to the disk). The I/O time for these two meth-ods is shown in Fig. 4. Considering both CPU and I/O times, the total execution time of FS is 143.94 seconds, and Fig. 3. Execution times for FS and SS.

(8)

that of SS is 100.89 seconds. Note that the execution time ratio for FS to SS is 0.70 in this case, which is slightly better than the one associated with Table 2.

Fig. 5 shows scale-up experiments, where both the CPU and I/O times of each method increase linearly as the data-base size increases. For this experiment, the traversal tree has 10 levels, the fanout of internal nodes is between 4 and 7, and the minimum support is set to 0.75 percent. It can be

seen that SS consistently outperforms FS as the database size increases.

4.3Sensitivity Analysis

Since, in general, algorithm SS outperforms FS, without loss of generality, we shall conduct the sensitivity analysis on various parameters for algorithm SS in this section. Per-formance evaluation was carried out under the condition TABLE 3

MEANING OF VARIOUS PARAMETERS H The height of a traversal tree.

F The number of child nodes, fanout.

N_J The number of internal nodes with an internal jump. p0 Backward weight in probability to its parent node.

pj Jump weight in probability to its internal jump. q A parameter of a Zipf-like distribution. HxPy x is the height of a tree and y = |P|.

|D| The number of reference paths (size of database). D_k Set of forward references for L_k.

C_k Set of candidate k-reference sequences. L_k Set of large k-reference sequences. |P| Average size of the reference paths.

TABLE 4

NUMBER OF LARGE REFERENCE SEQUENCES AND EXECUTION TIMES FOR H20P20

k 1 2 3 4 5 6 7 8 time (sec) Algorithm FS Ck 206 146 106 75 37 15 4 L_k 141 139 124 103 70 36 15 4 58.94 D_k 29 MB 29 MB 27.3 MB 13.8 MB 10.1 MB 5.3 MB 1.9 MB 0.6 MB 85.00 Algorithm SS C_k 206 370 106 75 37 15 4 Lk 141 139 124 106 70 36 15 4 57.89 Dk 29 MB - 29 MB - - - - 14.1 MB 43.00

(9)

that the database size is 200,000, the average size of tra-versal paths is 10 (i.e., |P| = 10), and the minimum support is 0.75 percent.

Fig. 6 shows the number of large reference sequences when the probability to backward at an internal node, p0,

varies from 0.1 to 0.5. As the probability increases, the number of large reference sequences decreases because the possibility of having forward traveling becomes smaller. Fig. 7 shows the number of large reference sequences when the number of child nodes of internal nodes, i.e., fanout F, varies. The three corresponding traversal trees all have the same height 8. The tree with 2 - 4 fanout consists of 483 internal nodes and 1,267 leaf nodes. The tree for the second bar consists of 11,377 internal nodes and 62,674 leaf nodes, and the one for the third bar consists of 74,632 internal nodes and 634,538 leaf nodes. The results show that the number of large reference sequences decreases as the

degree of fanout increases, because with a larger fanout the traversal paths are more likely to be dispersed to several branches, thus resulting in fewer large reference sequences. Clearly, when the large reference sequences decreases, the execution time to find them also decreases.

Fig. 8 gives the number of large reference sequences when the probability of traveling to each child node from an internal node is determined from a Zipf-like distribu-tion. Different values of parameter q for the Zipf-like distri-bution are considered. The Zipf-like distridistri-bution of branching probabilities to child nodes is generated as fol-lows. The probability pi that the ith child node is accessed

by a traversal path is pi = c/i(1-q

) , where c = 1/ 1 1 1 / ( ) i i n ₋ =

∑

4

θ

9

Fig. 5. Execution time of FS and SS when database size increases.

Fig. 6. Number of large reference sequences when backward weight p

(10)

is a normalization constant and n is the number of child nodes at an internal node. After we get each pi, it is then

normalized so that p₀ + p_i p_j i n + =

∑

1 = 1

as in Section 4.1. Setting the parameter q = 0 corresponds to the pure Zipf distribution, which is highly skewed, whereas q = 1 corresponds to the uniform distribution. The results show that the number of large reference sequences increases when the corresponding probabilities are more skewed.

Table 5 shows the performance results of SS when the number of internal nodes with internal jumps, NJ, varies

from 3 percent to 27 percent of the total internal nodes. The number of large reference sequences decreases slightly as N_J increases, meaning that it is less likely to have large ref-erence sequences when we have more jumps in traversal Fig. 7. Number of large reference sequences when the fanout F is varied.

Fig. 8. Number of large reference sequences when parameter q of a Zipf-like distribution is varied.

TABLE 5

NUMBER OF LARGE REFERENCE SEQUENCES WHEN THE PERCENTAGE OF INTERNAL JUMPS NJ IS VARIED

NJ [%] k 1 2 3 4 5 6 Time (sec) 3 L_k 94 91 84 58 21 3 18.76 9 _L k 94 92 83 56 22 2 18.70 15 _L k 93 90 83 55 22 3 18.88 21 _L k 93 90 82 55 22 3 18.95 27 L_k 90 87 80 53 20 2 _18.69

(11)

paths. It is noted that performance of SS is less sensitive to this parameter than to others.

Table 6 shows results of SS when the height of a traversal tree varies. The fanout is between 2 and 5. As the height increases, the numbers of internal nodes and leaf nodes increase exponentially. The height of a traversal tree is creased from 3 to 20, As the height of a traversal tree in-creases, the number of candidate nodes for L₁ increases and the execution time to find L1 thus increases. On the other

hand, |L1| can decrease as the height of the tree increases

since the average visit to each node decreases. The number of large reference sequences slightly decreases, for 1  k 3, when the height of the tree increases from 5 to 20.

5 C

ONCLUSION

In this paper, we have explored a new data mining capabil-ity which involves mining traversal patterns in an informa-tion-providing environment where documents or objects are linked together to facilitate interactive access. (This data mining capability is now incorporated into a Web usage mining tool, SpeedTracer [20].) Our solution procedure sisted of two steps. First, we derived algorithm MF to con-vert the original sequence of log data into a set of maximal forward references. By doing so, we filtered out the effect of some backward references and concentrated on mining meaningful user access sequences. Secondly, we developed algorithms to determine large reference sequences from the maximal forward references obtained. Two algorithms were TABLE 6

NUMBER OF LARGE REFERENCE SEQUENCES WHEN THE HEIGHT OF A TRAVERSAL TREE H IS VARIED

H k 1 2 3 4 5 6 7 Time (sec) 3 L_k 64 93 60 42 9 15.52 5 L_k 157 136 103 76 41 11 17.90 10 Lk 116 111 100 80 48 20 4 19.68 15 Lk 111 110 100 81 43 14 1 20.39 20 L_k 98 97 92 73 46 19 4 21.01

(12)

devised for determining large reference sequences: one was based on some hashing and pruning techniques, and the other was further improved with the option of determining large reference sequences in batch so as to reduce the num-ber of database scans required. Performance of these two methods has been comparatively analyzed. It is shown that the option of selective scan is very advantageous and algo-rithm SS thus in general outperformed algoalgo-rithm FS. Sensi-tivity analysis on various parameters was conducted.

A

PPENDIX

Generation of Large Itemsets and Candidate Itemsets

Given an example transaction database D, as shown in Fig. 9, the large itemsets and candidate itemsets can be de-termined as follows. In essence, generated first are large 1-itemsets, which are then used to construct candidate item-sets in the next pass. With the minimal support equal to two, after each database scan, large itemsets are determined from candidate itemsets with the number of occurrences greater than or equal to two. A detailed algorithm can be found in [5].

A

CKNOWLEDGMENTS

Ming-Syan Chen is supported, in part, by Project No. NSC 86-2621-E-002-023-T of the National Science Council, Tai-wan, Republic of China. Jong Soo Park is supported by the 1997 Grants for Professors of Sungshin Women’s Uni-versity in Korea.

R

EFERENCES

[1] R. Agrawal, C. Faloutsos, and A. Swami, “Efficient Similarity Search in Sequence Databases,” Proc. Fourth Int’l Conf. Foundations of Data Organization and Algorithms, Oct. 1993.

[2] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami, “An Interval Classifier for Database Mining Applications,” Proc. 18th Int’l Conf. Very Large Data Bases, pp. 560–573, Aug. 1992.

[3] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proc. ACM SIGMOD, pp. 207–216, May 1993.

[4] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Associa-tion Rules in Large Databases,” Proc. 20th Int’l Conf. Very Large Data Bases, pp. 478–499, Sept. 1994.

[5] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 11th Int’l Conf. Data Eng., pp. 3–14, Mar. 1995.

[6] T.M. Anwar, H.W. Beck, and S.B. Navathe, “Knowledge Mining by Imprecise Querying: A Classification-Based Approach,” Proc. Eighth Int’l Conf. Data Eng., pp. 622–630, Feb. 1992.

[7] T. Berners-Lee, R. Fiekding, and H. Frystyk, “Hypertext Transfer Protocol-HTTP/1.0,” Internet Draft, Feb. 1996.

[8] M. Bieber and J. Wan, “Backtracking in a Multiple-Window Hy-pertext Environment,” ACM European Conf. Hypermedia Technol-ogy, pp. 158–166, 1994.

[9] E. Caramel, S. Crawford, and H. Chen, “Browsing in Hypertext: A Cognitive Study,” IEEE Trans. Systems, Man, and Cybernetics, vol. 22, no. 5, pp. 865–883, Sept. 1992.

[10] L.D. Catledge and J.E. Pitkow, ”Characterizing Browsing Strate-gies in the World-Wide Web,” Proc. Third WWW Conf., Apr. 1995. [11] J. December and N. Randall, The World Wide Web Unleashed, SAMS

Publishing, 1994.

[12] J. Han, Y. Cai, and N. Cercone, “Knowledge Discovery in Data-bases: An Attribute-Oriented Approach,” Proc. 18th Int’l Conf. Very Large Data Bases, pp. 547–559, Aug. 1992.

[13] J. Han and Y. Fu, “Discovery of Multiple-Level Association Rules from Large Databases,” Proc. 21th Int’l Conf. Very Large Data Bases, pp. 420–431, Sept. 1995.

[14] R.T. Ng and J. Han, “Efficient and Effective Clustering Methods for Spatial Data Mining,” Proc. 18th Int’l Conf. Very Large Data Bases, pp. 144–155, Sept. 1994.

[15] J.-S. Park, M.-S. Chen, and P.S. Yu, “Using A Hash-Based Method with Transaction Trimming for Mining Association Rules,” IEEE Trans. on Knowledge and Data Eng., vol. 9, no. 5, pp. 813-825, Sept./Oct. 1997.

[16] G. Piatetsky-Shapiro, “Discovery, Analysis, and Presentation of Strong Rules,” Knowledge Discovery in Databases, pp. 229–248, 1991.

[17] J.R. Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, pp. 81–106, 1986.

[18] N.R. Trio, personal communication, May 1995.

[19] J.T.-L. Wang, G.-W. Chirn, T.G. Marr, B. Shapiro, D. Shasha, and K. Zhang, “Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results,” Proc. ACM SIGMOD, Minneapolis, pp. 115–125, May 1994.

[20] K.-L. Wu, P.S. Yu, and A. Ballman, “SpeedTracer: A Web Usage Mining and Analysis Tool,” IBM Systems J., vol. 37, no. 1, pp. 89-105, Jan. 1998.

Ming-Syan Chen received the BS degree in electrical engineering from National Taiwan Uni-versity, Taipei, Taiwan, Republic of China, in 1982; and the MS and PhD degrees in computer information and control engineering from the University of Michigan, Ann Arbor, in 1985 and 1988, respectively. Dr. Chen is now a professor in the Electrical Engineering Department at Na-tional Taiwan University. His research interests include database systems, Internet technologies, and multimedia applications. He was a research staff member at the IBM Thomas J. Watson Research Center, York-town Heights, New York, from 1988 to 1996, primarily involved in proj-ects related to parallel databases, multimedia systems, and data min-ing. He has published more than 75 refereed international jour-nal/conference papers in these research areas, and more than 30 of the papers have appeared in ACM and IEEE journals and transactions. Dr. Chen is currently an editor of IEEE Transactions on Knowledge and Data Engineering and also served as a guest co-editor for a special issue of IEEE Transaction on Knowledge and Data Engineering on mining of databases in December 1996. Has has invented many inter-national patents in the areas of interactive video playout, video server design, interconnection networks, and concurrency and coherency control protocols. He received the Outstanding Innovation Award from IBM in 1994 for his contribution to parallel transaction design for a major database product, and numerous other awards for his inventions and patent applications. Dr. Chen is a senior member of the IEEE and a member of the ACM.

Jong Soo Park received the BS degree in elec-trical engineering (with honors) from Pusan Na-tional University, Pusan, Korea, in 1981; and the MS and PhD degrees in electrical engineering from the Korea Advanced Institute of Science and Technology, Seoul, Korea, in 1983 and 1990, respectively. From 1983 to 1986, he served as an engineer at the Korean Ministry of National De-fense. He was a visiting researcher at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York, from July 1994 to July 1995. He is currently an associate professor in the Department of Computer Science at Sungshin Women’s University, Seoul, Korea. His research interests include data mining, geographic information systems, and digital libraries. He is a member of the ACM, the IEEE, and Korea In-formation Science Society (KISS).

(13)

Philip S. Yu (S’76-M’78-SM’87-F’93) received the BS degree in electrical engineering from National Taiwan University, Taipei, Taiwan, Re-public of China, in 1972; the MS and PhD de-grees in electrical engineering from Stanford University in 1976 and 1978, respectively; and the MBA degree from New York University in 1982. He has been with the IBM Thomas J. Wat-son Research Center, Yorktown Heights, New York, since 1978, and he is currently manager of the Software Tools and Techniques group there. His current research interests include database systems, data mining, multimedia systems, transaction and query processing, parallel and distributed systems, disk arrays, computer architecture, performance modeling, and workload analysis. He has published more than 220

papers in refereed journals and conferences, and more than 140 re-search reports, and 90 invention disclosures. He holds, or has applied for, 56 U.S. patents. Dr. Yu is a fellow of the IEEE and the ACM. He was an editor of IEEE Transactions on Knowledge and Data Engi-neering. In addition to serving as a program committee member for various conferences, he served as the program chair of the Second International Workshop on Research Issues on Data Engineering: Transaction and Query Processing, and as program co-chair of the 11th International Conference on Data Engineering. He has received several IBM and other industrial honors, including awards for best paper, IBM Outstanding Innovation, Outstanding Technical Achievement, 21 Invention Achievement plateaus, and two Research Division awards.