The CMP algorithm - Mining Closed Patterns in Multi-Sequence Time-Series Databases

Chapter 3 Mining Closed Patterns in Multi-Sequence Time-Series Databases

3.4 The CMP algorithm

right before

⎭⎬

in every transaction of the projected database, we can conclude that

⎭⎬ exist another frequent pattern

⎭⎬

whose support is equal to the support of

⎭⎬ pruned because the pattern

⎭⎬

would be generated later.

Lemma 3.3 (Same projections). A pattern p can be pruned during the process of mining closed patterns if (1) p shares the same parent with a pattern q in the frequent pattern tree, (2) q contains p, and (3) p’s projections are identical to q’s projections.

Proof. Since both patterns p and q share the same projections and q contains p, every pattern generated from p is contained by a pattern generated from q and both patterns generated have the same support. That is, every pattern generated from p is not closed.

Thus, p may be pruned during the process of mining closed patterns.

For example, as shown in Fig. 3.2, pattern

⎭⎬

_ contains pattern

⎭⎬

both patterns share the same parent in the frequent pattern tree. Moreover, both patterns have the same projections since both patterns appear at the first and fourth positions of the first transaction, at the fourth position of the second transaction, and at the fourth position of the third transaction. Therefore, pattern

⎭⎬

The proposed CMP algorithm is shown in detail in Fig. 3.3. First, we transform every sequence in the time-series database into the SAX representation [37] as described in Section 2.4. That is, the time-series database is transformed into a database

where each transaction in the database contains multiple symbolic sequences. Second, we scan the transformed database to find all frequent 1-patterns, and build a projected database for each frequent 1-pattern. Third, for each frequent pattern found, the algorithm calls the CMP-Growth procedure to recursively use a frequent k-pattern and its projected database to generate its frequent super-patterns at the next level in the frequent pattern tree, where k ≥ 1.

Algorithm: CMP

Input: an input database DB, a minimum support threshold δ, a maximum gap threshold τ

Output: all closed patterns CP

1 Transform each sequence in DB into a symbolic sequence by using the SAX representation;

2 Let CP be ∅;

3 Scan the transformed database to find all frequent 1-patterns. Let P1 be all frequent 1-patterns found in DB;

4 for each 1-pattern p in P1 do

5 Let PDB be the projected database of p;

6 call CMP-Growth (PDB, p, δ^,τ^{, CP);}

7 end for 8 return CP;

Fig. 3.3. The CMP algorithm.

Fig. 3.4 shows the CMP-Growth procedure. In step 1, for each frequent k-pattern p, we build its projected database and find candidate frequent 1-patterns that occur within the first τ+1 positions in the projected database. In steps 3-8, we use the backward checking strategy to check if p can be pruned. If not, we use the forward checking strategy to check if p is closed. If so, it is added to the closed pattern pool (CP). In steps 10-16, for each frequent 1-pattern c found in step 1, we concatenate p, g gaps and c together to generate a frequent (k+g+1)-pattern, where 0 < g < τ. Then, we check if the

concatenated pattern is contained by any sibling pattern and both share the same projected database. If not, the CMP-Growth procedure is called recursively to enumerate the frequent super-patterns of the newly generated frequent (k+g+1)-pattern.

The process is repeated until no more patterns can be generated.

Algorithm: CMP-Growth (PDB, p, δ^,τ^{, CP)}

Input: a projected database PDB, a prefix pattern p, a minimum support threshold δ^{, a} maximum gap threshold τ

Output: a set of closed patterns CP

1 Let candidate be a set of frequent 1-patterns within the first τ+1 positions in PDB;

2 Let sup be the support of p;

3 if (sup is not less than δ and p passes the backward checking strategy) then 4 if (p passes the forward checking strategy) then

5 if (p is closed with respect to CP) then

6 Add p to CP;

7 end if 8 end if

9 if (candidate is not empty) then

10 for each 1-pattern cin candidate do 11 for g = 0 to τ^do

12 Let PDB’ be the projected database of pΘGΘc where G contains g gaps in each sequence;

13 Check if pΘGΘc is contained by any sibling pattern and both share the same projections;

14 if not, call CMP-Growth (PDB’, pΘGΘc, δ^,τ^{, CP);}

15 end for

16 end for 17 end if 18 end if 19 return CP;

Fig. 3.4. The CMP-Growth procedure.

Lemma 3.4 Each pattern found by the CMP algorithm is frequent and closed.

Proof. In step 3 of the CMP algorithm, we generate all frequent 1-patterns by scanning the database once. In step 1 of the CMP-Growth procedure, we find a set of candidate frequent 1-patterns, which appear within the first τ+1 positions in the projected database of a frequent k-pattern p. These candidates are then concatenated with p and with gaps to form frequent (k+g+1)-patterns, k > 1. Since we have always checked the support of each newly pattern found against the minimum support threshold, we assure each pattern found by the CMP algorithm is frequent. Moreover, at each level of extension, we adopt the forward and backward checking strategies to eliminate non-closed patterns.

Therefore, every pattern found by the CMP algorithm is frequent and closed.

Lemma 3.5 Every closed pattern can be found by the CMP algorithm.

Proof. Since we scan the database once to find all frequent 1-patterns, every frequent 1-pattern in the database can be found by the CMP algorithm. To extend each frequent k-pattern p, we combine it with gaps and with all possible candidate frequent 1-patterns found in the p’s projected database to generate all its frequent super-patterns of p. Thus, every frequent k-pattern in the database can be found by the CMP algorithm. Once a frequent k-pattern is found, we adopt the forward and backward checking strategies to eliminate non-closed patterns. Therefore, every closed pattern can be found by the CMP algorithm.

Theorem 3.1 The CMP algorithm enumerates all closed patterns in the database.

Proof. By Lemma 3.4, every pattern found by the CMP algorithm is frequent and closed.

By Lemma 3.5, every closed pattern can be found by the CMP algorithm. Therefore, we can conclude that the CMP algorithm enumerates all closed patterns in the database.

Theorem 3.2 The time and space complexities of the CMP algorithm are bounded by O(|N|*|D|) and O(lp*|D| + |CP|), respectively, where the size of a time-series database, the number of nodes in a frequent pattern tree, the length of the longest frequent pattern, and the number of closed patterns are |D|, |N|, lp, and |CP|, respectively.

Proof. To transform a time-series database into a symbolic database, the SAX operation is applied to each numerical value, and hence the time complexity of transforming all time-series in the database into symbolic sequences is bounded by O(|D|). By scanning the transformed database, we obtain all frequent 1-patterns and this requires O(|D|) time.

Next, let us consider a pattern p of various lengths. If p is a frequent 1-pattern, the number of extensions is bounded by |C|, where |C| is the average number of local frequent 1-patterns in a projected database. Likewise, if p is a frequent 2-pattern, the number of extensions is bounded by |C|. Thus, we know that if p is a frequent k-pattern, the number of extensions is again bounded by |C|. Moreover, for each frequent k-pattern, its projected database is scanned once to find candidate frequent 1-patterns that occur within the first τ+1 positions in the projected database. Thus, the time complexity of scanning a projected database is bounded by O(|D|). Since the number of total nodes in the frequent pattern tree is |N|, the time complexity of the CMP algorithm is bounded by O(|D| + |D| + |N|*(|C| + |D|)) = O(|N|*|D|). As we have shown that the CMP algorithm mines patterns in a DFS manner, the maximum number of nodes that kept in the memory is bounded by lp, where each node also maintains a projected database that requires O(|D|) space in the worse case. Moreover, the closed pattern pool requires O(|CP|) space. Therefore, the space complexity of the CMP algorithm is bounded by O(|D| + lp*|D| + |CP|) = O(lp*|D| + |CP|).

在文檔中時間序列資料庫之封閉性樣式探勘 (頁 41-45)