Using Pattern-Join and Purchase-Combination for Mining Web Transaction
Patterns in an Electronic Commerce Environment
Ching-Huang Yun and Ming-Syan Chen
Department of Electrical Engineering
National Taiwan University
Taipei, Taiwan, ROC
E-mail: [email protected], [email protected]
Abstract
In this papel; we explore the data mining capability which involves mining Web transaction patternsfor an elec- tronic commerce (EC) environment. To better reflect the customer usage patterns in the EC environment, we pro- pose a mining model that takes both the traveling patterns and purchasing behavior of customers into consideration. We devise two eflcient algorithms (MTSp J , and MTSpc)
for determining the frequent transaction patterns, which are termed large transaction patterns in this paper: In addi- tion, algorithm WTMdevised in our prior work, is usedfor comparison purposes. By utilizing the path-trimming tech- nique which is developed to exploit the relationship between traveling and purchasing behaviors, M S ~ J and MTSpc are able to generate the large transaction patterns v e v ef- jiciently. A simulation model for the EC environment is de- veloped and a synthetic workload is generated f o r perfor- mance studies.
1 Introduction
With the rapid growth of the information sources avail- able in the World Wide Web, it has become increasingly important to efficiently analyze usage pattems in various emerging Web applications [4][8]. In some existing elec- tronic commerce environments, Web pages are usually de- signed as shop-windows, and customers can visit these Web pages and make Web transactions through the Web inter- face. One example scenario for a Web transaction is shown in Figure I , where a customer travels along the EC system and purchases a set of items in the corresponding nodes (i.e., Web pages) of hidher traversal path. Figure I(a) shows the traveling pattem of this customer and the Web transaction data is shown in Figure l(b), where for example, item il was purchased when the customer visits node B. Note that min-
A
L
Figure 1. An illustrative example for a Web transaction where nodes are marked gray if items are purchased there.
ing information from such an EC system can provide very valuable information on customer buying behavior and the quality of business strategies can then be improved [2][6].
Mining of databases has attracted a growing amount of attention in database communities due to its wide applica- bility in industry for improving marketing strategies [ 1][3]. Recently, data mining for Web usage has been drawing an increasing amount of attention from both research and in- dustrial communities. Web usage mining is the process of discovering interesting pattems in Web usage data. A study on efficient mining of path traversal pattems for capturing Web user behavior was conducted in [4]. WEBMINER
[SI
applies Apriori-related algorithms for mining Web usage as- sociation rules and sequential pattems. WebLogMiner [8]
proposes techniques for using OLAP and data mining to discover Web access pattems in Web access logs.
In this paper, we shall explore the data mining capability which involves mining Web transaction pattems for an EC
environment where customers can seek for items of interest for possible purchases [7]. For the measurement of cus- tomer purchasing behavior association, a novel knowledge
s2. ..., s,) N {s~s~...s,} is said to path-contain a path
{rlrz _.. rJ} as a consecutive subsequence ifthere exisls an z such that s,+= = r,r, for
I
5
x5
j .(called Web transaction rules) can be derived from the Web
transaction pattems. Clearly, the distinctive characteristics of mining Web transaction pattems increase the difficulty of extracting information from the Web transaction data. The design and development of efficient mining algorithms for mining Web transaction pattems is taken as the objective of this paper.
Consequently, to better reflect the customer usage pat-
Definition 2 Let <sls2 ... s, : nl{il}, n2{i2} ,..., n,{iJ> be a transaction pattern, where i,,,
C
lfor I5
m5
x, and {nl.n2 ,..., n,}
C
{SI, sa ,..., s?,}C
N. Then, <s1s2 ... sy : nl{il}, n2{i2) ...., n,{iT}> is said to pattern-contain a transac-tion pattern <wlw2 ... w, : rl{tl}, 1*2{12} ,..., rp{t,,}>
if
and onlyif
{ S I S ~ ... sy} path-contains {wlwz ... w9} and { n l { i l } ,n2{i2} ,..., n:E{i,}l contains {r1 {til, r2{12} ...., rp{tp)}.
tems in the EC environment, we propose a mining model that takes both the traveling pattems and purchasing behav- ior of customers into consideration. First, for each Web transaction, we develop BTB (Backward-Then-Branching)
Definition 3 A Web transaction is .said to pattern-contain <wlw 2...w, : rl{tl}, r2{t2}, ..., rp{tp}>
if
one of its Web transaction recordy pattern-contains <w1w2 ... w 4 : r1 {[I},r2{t2},..., rD{t,,}>.
algorithm, to extract meaningful Web transaction records from the given Web transaction. After all the Web trans- action records are derived from Web transactions, three al- gorithms are developed for determining the large transac- tion patterns from the Web transaction records, where a
large transaction pattem is a transaction pattem that ap- peared in a sufficient number of Web transactions. In our prior work [7], algorithm WTM (Web Transaction Mining;) is devised by directly extending the schemes
on prior work in mining path traversal pattems [4] to mine Web transactions. However, without utilizing the paths of large transaction pattems, WTM may gener- ate a lot of unqrialijied candidate transaction patterns,
thus degrading the performance. In contrast, algorithm
M T S ~ J (Maximal-Transaction-Segment with Pattern Join)
and MTSpc (Maximal-Transaction-Segment with Purchase Combination), are developed based on the path-trimming
technique to explore the fact that one can trim the genera- tion of the candidate transaction pattems according to the paths traversed and keep, in the same maximal transaction segment, the related information which appears in the same path. A simulation model for the EC environment is devel- oped and a synthetic workload is generated for performance studies. By utilizing the maximal transaction segment for path-trimming, MTSp J and M T S p c are shown to outper-
form WTM in execution efficiency.
This paper is organized as follows. Preliminaries are given in Section 2. In Section 3, three algorithms, WTM, MTSp J , and MTSpc, are devised for mining Web transac-
tion pattems. Experimental studies are conducted in Section
4.
This paper concludes with Section5.
. .
Note that a Web transaction consists of a set of purchases along the corresponding nodes in its traversal path, such as the example shown in Figure I . Unlike the work in the path traversal pattems
[4],
the mining on Web transaction pat- terns takes both traveling pattems and purchasing behav- ior into consideration. The Web transaction rule, derived from Web transaction pattems in this paper, is an implica- tion of the form <nln2 ... n, : X=+
Z>, where X and Zboth are sets of purchases, X
n
Z == @, and ( n l , n2 ,..., n,}C
N. The rule <nln2 ... n, : X ==+ Z> has supports in theWeb transaction database D i f s % of the Web transactions in D pattem-contain <nln2 ... n, : :X U Z>. Also, the rule <nln2 ... ny : X Z > holds with confidence c if c% of
Web transactions in D that contain X also contain Z along the path (nln2 ... n u } .
Before devising algorithms for [determining large trans- action pattems in Section 3, we shall first present algo- rithm BTB devised for deriving meaningful Web transac- tion records from each Web transaction. Explicitly, given a Web transaction of a customer, BTB proceeds as fol- lows. First, BTB traces the Web transaction to output the customer transaction records, wher'e a customer transaction record consists of a path and the items purchased in the corresponding nodes of that path. Each customer transac- tion record is used as a branch for constructing the cus- tomer transaction tree. Algorithm BTB, i.e., Backward- Then-Branching, is named for the reason that it outputs the record stored in the buffer to become a branch of a cus- tomer transaction tree as long as this customer makes back- ward movement first and branching movement later. Af- ter the customer transaction tree is constructed, BTB will then traverse the tree in a depth-first manner to output Web transaction records. When traveling to a leaf node, BTB outputs a Web transaction record which consists of a path
made along the path, Finally, Bm, stores all Web transac- tion records, according to the corresponding Web transac- tion identifiers, into the database.
2
Preliminaries
Let = I n l > n 2 2 . " 3 "9
1
be a set Of nodes in the from the root node to that leaf node and a set of purchases "ent and I = {il, i2,..., i h J be a set of items soid in theEC system. We then have the following definitions: Definition 1 Let (sls2 ... s,} be apath sequence, where {SI,
3 Algorithms for Mining Web Transaction
Patterns
In Section 3.1, algorithm WTM described in [7], is apro- cedure of mining large transaction pattems. By utilizing the maximal transaction segment for path-trimming, we devise two algorithms, algorithm MTSp J and algorithm MTSpc,
for efficiently determining large transaction pattems. Let
Ck be a set of candidate k-transaction pattems and T k rep- resent the set of large k-transaction pattems. Because both M T S ~ J and MTSpc generate T k along with the genera- tion of Ck+l, we use round k to refer to the procedure ex- ecuted to obtain ( T k , Ck+l). In Section 3.2, M T S ~ J uti- lizes the pattem join scheme for the generation of candidate transaction pattems. In Section 3.3, algorithm MTSpc is improved by employing the purchase combination scheme, and is able to generate fewer uncertain candidate transac- tion pattems than MTSp J in each round, thus reducing the
computational overhead.
-
3.1
Algorithm WTM
Similarly to scheme FS in [4], WTM [7] joins the pur- chased itemsets for generating candidate transaction pat- tems. However, unlike FS, WTM employs a two-level hash tree, called Web transaction tree, to store candidate transac- tion pattems. Figure 2 is an illustrative example for mining Web transaction pattems with WTM and Figure 3 is one part of the Web transaction tree storing the C2 in Figure 2.
According to each Web transaction record of a Web trans- action, the support of a candidate transaction pattem is de- termined by the number of Web transactions that pattem- contain this candidate transaction pattem. WTM then ob- tains large transaction pattems during the procedure of de- structing the Web transaction tree. Each large transaction pattem is generated when its support exceeds the minimum support. For example, one can destruct the Web trans- action tree in Figure 3 to determine the T2in Figure 2. When traversing to node E, two large 2-transaction pattems {<ABCE: B{il}, E{i4}>, <ABCE: C{i2}, E{i4}>} are generated.
Consider the example scenario in Figure 2. In the first round, WTM constructs the Web transaction tree by hash- ing each Web transaction record to construct the Web trans- action tree and counts the support of individual purchases. Then, WTM destructs the Web transaction tree for deriving
T I ,
the set of large I-transaction pattems, and utilizes the purchased itemsets in TI for generatingCz,
the set of candi- date 2-transaction pattems to be stored in a Web transaction tree. In each subsequent round, WTM starts with candi- date transaction pattems found in the previous round for the counting of supports and identifies the large transaction pat- tems. Then, WTM proceeds to the generation of new candi-Figure 2. An illustrative
Ti
example for min- ing Web transaction patterns with algorithm WTM.
date transaction pattems and stores them to the Web transac- tion tree. To illustrate the operations of WTM, it can be seen from Figure 2 that both the Web transactions with WT-ID
= 100 and WT-ID = 200 contain one Web transaction record <ABCE: B{il}, C{iz}, E{ia}> that pattem-contains <ABCE: B{il}, E{i4}>. Also, in the Web transaction with WT-ID = 300, one Web transaction record <ABCE: B {il },
E{i4}> pattem-contains <ABCE: B{il}, E{i4}>. Thus, the occurrence count of <ABCE: B{il}, E{i*}> is 3. In ad- dition, consider the counting of the candidate set C1 for ex- ample. In the Web transaction with WT-ID = 100, two Web transaction records <ABCE: B{il}, C{i2}, E{i4}> and <ABFGH: B{il}, H{iG}> pattem-contain <AB: B{il}>. Though both pattem-containing <AB: B {il }
>,
these two Web transaction records only account for one more support count for <AB: B{il}> since they are from the same Web transaction 100, thus avoiding the duplicate counting in the different Web transaction records of the same Web transac- tion. Hence, the final support of <AB: B{il }>
is 3.After all large transaction pattems are obtained, one can derive the Web transaction rules from large transaction pat- tems. In this example, <ABCE : B{il}, C{i2}, E{i4}> is one large 3-transaction pattem with support = 2 and <AB : B{il}> is one large I-transaction pattem with support = 3. As a result, we can derive one Web transaction rule
i/
*,
B I G i\
J i L I
(4
candidate transaction patterns (unqualified patterns are marked gray)
IC21=
'r
(b)candidate transaction patterns after trimmed by path-trimming
Figure 4. Path-trimming technique by using the maximal transaction segment.
Figure 3. The Web transaction tree for storing candidate 2-transaction patterns.
<ABCE : B{il}
==+
C{i2}, E{i4}> with the support equal to support(<ABCE : B{il}, C{i*}, E{i4}>) = 2 and the67%.
confidence equal to ~ W P ~ ~ ~ ( < A B C E : B { Z I },C{Z2},E{24 }>) =
s u p p o r t ( < A B : B { z l } > )
3.2 Algorithm
M T S ~ J
As can be seen from Figure 2, without utilizing the paths of large transaction pattems, algorithm WTM gen- erates a lot of unqualified candidate transaction pattems, thus degrading the performance. For example, one candi- date 2-transaction pattern <AS: B{il}, S{i7}> is gener- ated by joining large 1-transaction pattems <AB: B{i,}> and <AS: S{i7}>. However, <AS: B{il}, S{i7}> isnever counted during the support counting because the nodes of the path AS do not even contain the node B of the pur- chase B{il}. This problem is called the unqualifjed can- didate transaction pattern problem. This implies that one
can trim the generation of the candidate transaction pattems according to the paths traversed. In light of this concept of path-trimming, algorithm MTSp J is designed to solve the unqualified candidate transaction pattem problem. Ex- plicitly, during the generation of large transaction pattems, by destructing the Web transaction tree, M T S p j not only determines large transaction pattems but also uses a buffer to keep a segment that contains large transaction patterns and the maximal path so as to properly classify the pat- tems, where the maximal path corresponds to a path from the root node to the leaf node of the Web transaction tree. This segment is called the maximal transaction segment in that MTSp J joins large transaction pattems for generating candidate transaction patterns only when the leaf node of the Web transaction tree is reached. The purpose of clas- sifying the pattems is that the pattems, whose paths do not path-contain each other, need not be considered to generate
candidate transaction patterns together.
Explicitly, MTSp J applies the pattem join scheme to the
large (k- ])-transaction pattems in leach segment (denoted as
Tks_l)
for the generation of candidate k-transaction pat- tems in the same segment (denoted as C,"). Also, CL is the set of uncertain candidate k-transaction pattems gener- ated from the pattern joins. In round k-I, we can deriveT,...l =
CT2-,
and C, =CCf
=C(T:-,
*
Tf-,),where Tks_l *T:-, means joining the large (k-1)-transaction pattems in each segment.
CI;
,is derived after the (k-1)- subpattem identifications in C,. Note that the candidate transaction pattems generated in the different segments may overlap, but those redundant ones will be detected and deleted when they are inserted into the Web transaction tree. For the example shown in Figure 2, WTM gener- ates 28 candidate 2-transaction pattems shown in Figure 4(a) whereas M T S ~ J generates 10 candidate 2-transaction pattems shown in Figure 4(b), showing a significant perfor- mance improvement of MTSPJ over WTM. This demon- strates the very advantage of the path-trimming technique MTSp J employs.3.3 Algorithm MTSpc
Algorithm M T S p c is similar to algorithm M T S p j in that it also employs the path-trimming concept of the max- imal transaction segment to reduce the computational over- head, but is different from the latter in that M T S p c , by utilizing the information in purchases, is able to reduce the number of uncertain candidate transaction pattems, thus hrther reducing the corresponding computational overhead and memory consumption. The method for M T S p c to re- duce the number of uncertain candidate transaction pattems is described below. Based on the information in the pur-
In parr k.
(I) LCfill<ring large count ( L C ) ofcnch purchase w 1 1 be greater fhsn k-1
(2)Pmrrharr Combination combine each k purchases with the path dcnved from the maxlmrl path to become an uncenam candrdalc ~ r d n i a c t m n patfern. and
fhc number o f purchaicr (111 1 ~n L e maximal lranrrclion segment 81 thus greater t h a n k
nizing their
LCs.
In addition, M T S p c utilizes purchase combination to generate fewer uncertain candidate transac- tion pattems than the pattem join of M T S p j . For the same generating C, example, MTSp J generates 4 uncertain can-didate transaction pattems and three pattems are pruned in the prune step, leading to 12 subpattem identifications. In contrast, algorithm M T S p c only generates one uncertain candidate transaction pattem and no pattem needs to be pruned in the prune step, leading to 3 subpattem identifi- cations. This shows the very advantage of the
LC
filtering and the purchase combination technique M T S p c employs.(a) W h e n t h e leaf node E is r e a c h e d (b) W h e n t h e leaf node Q IS r e a c h e d
4 Experimental Results
Figure 5. LC filtering and purchase combina-tion techniques utilized by algorithm MTSpc.
chases of large transaction pattems in the maximal trans- action segment, M T S p c computes, for each purchase, the large count (LC) which refers to the number of purchases
appearing in large transaction pattems, and utilizes the LC
to further prune unqualified purchases (referred to as LC
jltering). LC filtering is devised in light of the observation that for each purchase that can be qualified as a purchase in one of Tk+1, that purchase must appear in at least k large k-transaction pattems. In round k, when reaching the leaf node of the Web transaction tree, M T S p c discards those purchases whose LCs are smaller than k-l by this tech- nique of LC filtering. For the example shown in Figure 2, if there exists one large 3-transaction pattem <ABCE :
B{il}, C{i2}, E { i 4 } > , then we know that there must ex- ist three large 2-transaction pattems which are <ABC :
B{il}, C{i2}>, <ABCE : B{il}, E{i4}>, and <ABCE :
C{i2}, E{i4}>. In this example, purchases B{il}, C{i2}, and E{it} in <ABCE : B{il}, C{i2}, E{i4}> all appear in two large 3-transaction pattems. Recall that Ck is the set of uncertain candidate k-transaction pattems and Ck =
C;.
By utilizing the purchase combination scheme, M T S p c generates
C t
by combining k purchases of those purchases satisfying the LC threshold. As such, the number of un- certain candidate transaction pattems, which will in turn lead to the subpattem identifications, can be reduced. Af- ter Ck is obtained, the subpattem identification proceeds. In the subpattem identification of each uncertain candidate k-transaction pattem, its (k-1)-subpattems are evaluated to see if they are large (k-])-transaction pattems.In Figure 3, when reaching the leaf node E, the sce- nario for M T S p c to utilize
LC
filtering and the purchase combination technique is shown in Figure 5(a). Similarly, when reaching node Q, M T S p c also examines the maximal transaction segment as shown in Figure 5(b). As a result,MTSpc is able to prune unqualified purchases by scruti-
The method used by this study for generating synthetic Web transactions is similar to the one used in [4] with modi- fication noted below. First, we construct a traversal tree and determine the items sold in the nodes of this tree to sim- ulate the EC environment whose starting position is a root node of the tree. The traversal tree consists of two types of nodes, namely intemal nodes and leaf nodes. The number of child nodes at each intemal node is calledfanouf and is determined from a uniform distribution. The percentage of nodes with item selling is denoted by N I and those nodes (i.e., selling nodes) are determined randomly among all the intemal nodes. The number of items sold in a selling node is denoted by nz and the purchasing probability of the item in that node is denoted by Pb. When a customer visits the EC system, the Web transaction completed by this customer consists of a traversal path and a set of purchases made in the corresponding nodes.
A
traversal path consists of nodes accessed by a user. The size of each traversal path is deter- mined from a Poisson distribution with mean equal to IPI.With the root node being the entrance node, Web transac- tions are generated probabilistically within the traversal tree as follows. For each node, the next hop is determined ac- cording to the probability model taken in [4]. In addition, the percentage of jumping to destination nodes
N D
is alsomodeled and assigned to 1%0, and those nodes are deter- mined randomly among all the intemal nodes.
Figure 6 shows the relative performance between WTM and M T S p c , when N I = 80 %, the fanout is between 4 and 7, the root node has 7 child nodes, the height of the tree is
IO, the numbers of intemal and leaf nodes are, respectively, 16,848 and 75,632, the number of items sold in those nodes is 13,478, /Dl = 200,000, s = 0.5%0, Pb = 0.5, and /PI = 30. Figure 6(a) shows that M T S p c outperforms WTM in each round and Figure 6(b) contains the pattems generated. Ex- plicitly, in round 1, M T S p c uses the maximal transaction segment to classify the pattems and thus generates much fewer candidate 2-transaction pattems to construct the Web transaction tree than WTM. Hence, although it spends time in classifying the pattems, M T S p c incurs much shorter ex-
I
Figure 6. Performance comparison between WTM and MTSpc in each round.
sono - ~ O O O
A
n
2ODk 4nnk 6OOk snnk I nook
number of W e b transactions
Figure 7. Execution time for WTM, MTSPJ, and MTSpc when the database size increases.
ecution time than WTM. In round 2, M T S p c saves time by destructing a much smaller Web transaction tree con- structed from round 1. By doing so, M T S p c uses the max- imal transaction segment to classify the large 2-transaction pattems and utilizes the
LC
filtering to prune the unquali- fied purchases, which in turn results in much fewer uncer- tain candidate 3-transaction pattems generated. In the fol- lowing rounds, M T S p c also outperforms WTM due to its advantages of path-trimming and purchase combination. To provide more insights into the maximal transaction segment devised for path-trimming, it is shown in Figure 7 that the execution times of MTSp J and M T S p c increase linearly as the database size increases, indicating the good scale-up feature of MTSp J and M T S p c .5
Conclusion
In this paper, we examined the issue of mining Web transaction pattems that takes both traveling pattems and purchasing behavior into consideration such that one can have a better model for an EC system and thus well capture and exploit the intrinsic relationship between these two cus- tomer behaviors. To address this issue, we developed algo- rithms for effectively mining Web transaction pattems. Two algorithms (MTSPJ and M T S p c ) arc: devised to determine large transaction pattems. By utilizing the path-trimming technique which is developed in light of the relationship between traveling and purchasing behaviors, MTSp J and
M T S p c are able to generate the Web transaction pattems very efficiently. A simulation model for the EC environ- ment was developed and a synthetic workload was gener- ated for performance studies.
Acknowledgments
The authors are supported in part by the National Science Council, Project No. NSC 89-2219-E-002-007, NSC 89- 22 13-E-002-032, and the Ministry of Education Project No. 89-E-FA06-2-4-7, Taiwan, Republic of China.
References
[ I ] R. Agrawal and R. Srikant. Fast Algorithms for Mining As-
sociation Rules in Large Databases. Proceedings of the 20th International Conference on Very L.arge Data Bases, pages 478-499, September 1994.
[2] A. G. Buchner and M. Mulvenna. Discovery Intemet Market- ing Intelligence through Online Analytical Web Usage Min- ing. ACM SIGMOD Record, 27(4):54-6 1, Dec. 1998.
[3] M.-S. Chen, J. Han, and P. S . Yu. Data Mining: An Overview
from a Database Perspective. IEEE Transactions on Knowl-
edge andData Engineering, 8(6):86&833, 1996.
[4] M.-S. Chen, J . 4 . Park, and P. S. Yu. Efficient Data Mining for Path Traversal Patterns. IEEE Transactions on Knowledge and Data Engineering, 10(2):209-221, April 1998. [ 5 ] R. Cooky, B. Mobasher, and J. Srivastava. Data Preparation
for Mining World Wide Web Browsing Patterns. Journal of Knowledge and Information Systems, I ( I), 1999.
[6] C. Schlueter and M. J. Shaw. A Strategic Framework for De- veloping Electronic Commerce. IFEE Internet Computing, 1(6):2@-28, Nov./Dec., 1997.
[7] C.-H. Yun and M . 3 . Chen. Mining 'Neb Transaction Pattems in an Electronic Commerce Environment. Proc. of the 4th PaciJic-Asia Conf on Knowledge Dircovety and Data Mining (PAKDD'2000), pages 216-219, April 2000.
Discovering Web Ac- cess Pattems and Trends by Applying OLAP and Data Min- ing Technology on Web Logs. Proc. Advances in Digital Li- braries Conf (ADL'98). Santa Barbara, CA, pages 19-29, April 1998.