Chapter 4 Mining Closed Flexible Patterns in Time-Series Databases
4.5 The CFP algorithm
The proposed CFP algorithm consists of three phases as shown in Fig. 4.4. First, we transform each time-series sequence in the database into the SAX representation as described in Section 2.4. Second, we scan the transformed database once to find all frequent 1-patterns and build a projected database for each 1-pattern found. Third, we recursively call the CFP-Extension procedure to extend each 1-pattern found in the second phase to mine closed flexible patterns in a depth-first search manner.
Algorithm: CFP
Input: a time-series database TDB, a minimum support threshold δ, a maximum gap threshold τ
Output: a complete set of closed flexible patterns, CP
1 Transform each sequence in TDB into a symbolic sequence by the SAX representation;
2 Scan the transformed database once to find all frequent 1-patterns. Let P1 be a set of frequent 1-patterns found;
3 Let CP = ∅;
4 for each 1-pattern p in P1 do
5 Construct the projected database of p, PDB;
6 call CFP-Extension (PDB, p, δ, τ, CP);
7 end for 8 return CP;
Fig. 4.4. The CFP algorithm.
The CFP-Extension procedure is shown in Fig. 4.5. In steps 1-4, if p passes the redundant extension pruning strategy and closure checking scheme, it is added to the closed flexible patterns pool (CP). Then, we find the nominees of p in step 5. For each nominee found, we combine it with p to generate the closed flexible super-patterns of p in steps 6-21. In steps 7-15, we combine p and each nominee of p with every possible gap interval, and use the index table to count the support of the combined pattern, and
42
check if the combined pattern is frequent. If this is the case, the combined pattern is added to NP. Next, we use the redundant projection pruning strategy to eliminate unnecessary patterns in step 16. In steps 17-20, for each combined pattern in NP, we construct the projected database of the pattern and recursively call the CFP-Extension procedure to enumerate the closed flexible patterns.
Let us elaborate the concept of the pattern extension. Once all nominees are found, we combine p and each of these nominees with all possible combinations of gap intervals. Recall that τ is the maximum gap threshold. The number of possible combinations of gap intervals is thus 1+2+…+τ+(τ+1) = (τ+1)(τ+2)/2. We then combine p and one of the nominee ri by taking into account all possible combinations of gap intervals and represent it in the form of p[xi, yi]ri, where xi and yi are non-negative integers, 0 < xi < yi < τ. These newly combined patterns are candidate 2-patterns. Next, we need to check if they are frequent before inserting them into the second level of the tree. Assume p = a, δ = 2, τ = 1, and the projected database of a contains two sequences, {S1:<b e c f e>, S2:<d e a b c>}. We find nominee(p) = {b, e}. Since τ = 1, we have three possible combinations of gap intervals, namely, [0, 0], [0, 1], [1, 1]. Hence, we combine p with the first nominee b with these gap intervals and obtain 3 candidate 2-patterns: a[0, 0]b, a[0, 1]b, a[1, 1]b.
To count the supports for these candidates, we use the index table as described in Section 4.3. We first create an index table with two columns (τ+1 = 2). Column 0 stores any sequence ID where b appears right after a in the sequence, while column 1 stores any sequence ID where b appears two positions after a. In this case, we have {S1, S2} in column 0 and ∅ in column 1. Therefore, the support of a[0, 0]b, a[1, 1]b are 2 and 0, respectively. To count the support of a[0, 1]b, we simply merge the lists in columns 0 and 1, and obtain a list containing S1 and S2. Hence, the support of a[0, 1]b is 2. Since we find that a[0, 0]b and a[0, 1]b have the same support and share the same projected database, a[0, 1]b is pruned by the redundant projection pruning strategy (Lemma 4.3).
43
Similarly, we combine p with the second nominee e and obtain one frequent 2-pattern:
a[1, 1]e.
After we obtain all frequent 2-patterns, we apply the same procedure of the third phase to these frequent 2-patterns and extend them to frequent 3-patterns. Let us continue with the last example. After finding the frequent 2-patterns a[0, 0]b, and a[1, 1]e, we use these frequent 2-patterns to generate frequent 3-patterns. We first find the frequent 1-patterns in the projected database of a[0, 0]b, where nominee(a[0, 0]b) = {c}.
Thus, we can generate three candidate 3-patterns: a[0, 0]b[0, 0]c, a[0, 0]b[0, 1]c, a[0, 0]b[1, 1]c. By using an index table to count the supports of these candidate 3-patterns, we find that a[0, 0]b[0, 1]c is frequent with support equal to 2. Similarly, we can apply the same procedure to a[1, 1]e and extend it to frequent 3-patterns.
While recursively performing the procedure of the third phase, we integrate the proposed pruning strategies and closure checking scheme to speed up the mining process. Upon getting a frequent pattern, we first use the redundant extension pruning strategy (Lemma 4.2) to check if the pattern can be pruned. If not, we apply the closure checking scheme (Lemma 4.1) to check whether the pattern is closed. If the pattern is closed, we add it to the pool of closed flexible patterns and extend it. As we have seen, the effort of mining frequent patterns is determined by the parameter τ. As τ is set to a large value, the number of possible combinations of gap intervals to form candidate patterns is large. However, we may use the redundant projection pruning strategy (Lemma 4.3) to prune unnecessary combinations.
In summary, we use a frequent k-pattern p to generate candidate (k+1)-patterns, each of which is formed by concatenating p, a gap interval, and one nominee of nominee(p). For each candidate (k+1)-pattern generated, we check if it is frequent and closed. If it is, we recursively perform the CFP-Extension procedure to mine its frequent super-patterns in a DFS manner. Accordingly, we mine all closed flexible patterns.
44
Procedure: CFP-Extension (PDB, p, δ, τ, CP)
Input: a projected database PDB of p, a frequent flexible pattern p, a minimum support threshold δ, a maximum gap threshold τ
Output: a complete set of closed flexible patterns, CP
1 if (p passes the redundant extension pruning strategy) then 2 if (p passes the closure checking scheme) then 3 Add p to CP;
4 end if
5 Find nominee(p) from PDB;
6 for each r in nominee(p) do
7 Let Σ be all possible combinations of gap intervals generated from τ, where 0 < xi < yi <τ ;
8 Let NP = ∅;
9 for each [xi, yi] in Σ do
10 Combine p, [xi, yi], and r to form p[xi, yi]r;
11 Use an index table to count the support of p[xi, yi]r;
12 if (the support of p[xi, yi]r is not less than δ) then 13 Add p[xi, yi]r to NP;
14 end if
15 end for
16 Apply the redundant projection pruning strategy to eliminate unnecessary patterns in NP;
17 for each pattern q in NP do
18 Construct q’s projected database, PDB’;
19 call CFP-Extension (PDB’, q, δ, τ, CP);
20 end for 21 end for 22 end if 23 return CP;
Fig. 4.5. The CFP-Extension procedure.
45
Lemma 4.4 Every pattern found by the CFP algorithm is frequent and closed.
Proof. In the second phase, we scan the database once to find all frequent 1-patterns. In the third phase, at each level of extension, we use a frequent k-pattern p at the previous level and every nominee found in p’s projected database to form frequent (k+1)-patterns, k > 1. Whenever a new pattern is found, we check the support of the pattern against the minimum support threshold and eliminate it if it is not frequent. Hence, every pattern found by the CFP algorithm is frequent. Moreover, at each level of extension, we adopt the closure checking scheme to eliminate non-closed patterns. Therefore, every pattern found by the CFP algorithm is frequent and closed.
Lemma 4.5 Every closed flexible pattern can be found by the CFP algorithm.
Proof. In step 2 of the CFP algorithm, we scan the database once to find all frequent 1-patterns. Thus, every frequent 1-pattern in the database can be found by the CFP algorithm. In steps 6-21 of the CFP-Extension procedure, for each frequent k-pattern p, we combine it with every nominee found in p’s projected database and use the index table to enumerate all frequent super (k+1)-patterns of p, k > 1. Thus, every frequent (k+1)-pattern in the database can be found by the CFP algorithm. Once a frequent k-pattern is found, we adopt the closure checking scheme to eliminate non-closed patterns in step 2 of the CFP-Extension procedure. Therefore, every closed flexible pattern can be found by the CFP algorithm
Theorem 4.1 The CFP algorithm enumerates all closed flexible patterns in the database.
Proof. By Lemma 4.4, every pattern found by the CFP algorithm is frequent and closed.
By Lemma 4.5, every closed flexible pattern can be found by the CFP algorithm.
Therefore, we can conclude that the CFP algorithm enumerates all closed flexible patterns in the database.
Theorem 4.2 The time and space complexities of the CFP algorithm are bounded by O(|N|*(r*g + |D|)) and O(lp*|D| + |CP|), respectively, where the size of a time-series
46
database, the number of nodes in a sequence tree, the length of the longest frequent flexible pattern, the number of nominees for a frequent flexible pattern, the number of possible combinations of gap intervals, and the number of closed patterns are |D|, |N|, lp, r, g, and |CP|, respectively.
Proof. Since the size of the database is |D|, the time complexity of performing the SAX operation is bounded by O(|D|). By scanning the transformed database, we obtain all frequent 1-patterns and this requires O(|D|) time. Next, consider a pattern p of various lengths. If p is a frequent 1-pattern, the number of extensions is bounded by r*g.
Likewise, if p is a frequent 2-pattern, the number of extensions is bounded by r*g. Thus, we know that if p is a frequent k-pattern, the number of extensions is again bounded by r*g. Moreover, for each frequent k-pattern p, the time requires to scan p’s projected database to find the nominees of p is bounded by O(|D|). Since the number of total nodes in the sequence tree is |N|, the time complexity of the CFP algorithm is bounded by O(|D| + |D| + |N|*(r*g + |D|)) = O(|N|*(r*g + |D|)). As we design the CFP algorithm based on the depth-first search approach, the maximum number of nodes that kept in the memory is bounded by lp, where each node also requires O(|D|) space in the worse case to maintain its projected database. Moreover, the closed pattern pool requires O(|CP|) space. Therefore, the space complexity of the CFP algorithm is bounded by O(|D| + lp*|D| + |CP|) = O(lp*|D| + |CP|).
4.6 An example
Let us demonstrate how the CFP algorithm works. Given a time-series database containing four sequences, each sequence is transformed into a symbolic sequence as shown in Fig. 4.1. Assume that δ = 3 and τ = 2. After scanning the transformed database and computing the support for each 1-pattern, we find four frequent 1-patterns and add them to the level 1 of the sequence tree as shown in Fig. 4.2.
Next, we apply the redundant extension pruning strategy on a and find that it
47
should be extended. We then construct a’s projected database, which is {<b c d b a>, <c a>, <b f c c d>}. Moreover, we scan the projected database and find nominee(a) = {c}.
Since c appears in the first 3 positions of each sequence in a’s projected database, a does not pass the closure checking scheme, and hence it is not closed. The node of a is changed into a dotted circle as shown in Fig. 4.2. Subsequently, a is extended by concatenating it with c and one of the following gap intervals: [0, 0], [0, 1], [0, 2], [1, 1], [1, 2], and [2, 2]. Since δ is 3, we only obtain a frequent 2-pattern a[0, 2]c with support equal to 3. The projected database of a[0, 2]c is {<d b a>, <a>, <c d>}. Since we cannot find any nominee in the projected database, no more frequent pattern can be generated.
Since pattern b passes the redundant extension pruning strategy, it should be extended. Next, we construct b’s projected database, which is {<c d b a>, <e a c a>, <e c b d>, <f c c d>}. We then scan the projected database and find nominee(b) = {c}. We also perform the closure checking scheme and find that c appears in the first 3 positions of each sequence in b’s projected database, and hence b is not closed. The node of b is changed into a dotted circle as shown in Fig. 4.2. Then, we discover three frequent 2-patterns b[0, 1]c, b[0, 2]c, and b[1, 2]c with supports 3, 4, and 3, respectively. We recursively perform the pattern extension procedure on these frequent 2-patterns to get frequent super-patterns until no more frequent patterns can be generated. As shown in Fig. 4.2, the subtree under node b contains three closed patterns b[0, 2]c, b[1, 2]c, and b[0, 1]c[0, 1]d. Similarly, we can extend patterns c and d to find their frequent super-patterns. Finally, we find all closed flexible patterns in the database, including a[0, 2]c, b[0, 2]c, b[1, 2]c, b[0, 1]c[0, 1]d, as shown in Fig. 4.2.
48