The CNP algorithm - Mining Multi-Resolution Closed Numerical Patterns in Multi-Sequence

Chapter 5 Mining Multi-Resolution Closed Numerical Patterns in Multi-Sequence

5.7 The CNP algorithm

0 to be the pivot, by checking the array, index i is not moved since the first

element is similar to itself, whereas index j is moved forward to the fourth position.

From i to j–1, there are three candidates which may be similar to the pivot, namely,

⎭⎬

0 . We further check if these candidates are similar to the pivot

and find that all of them are similar to the pivot. Thus, the corresponding projections are stored in the projected database of

⎭⎬

check if it is a frequent 1-pattern. We find that i is unchanged while j is moved to the fifth position. Therefore, the corresponding projections are stored in the projected database of

⎭⎬

0 . We further check the next pivot

⎭⎬

0 and has the same projected database as

⎭⎬

0 . This process is continued until the last element

⎭⎬

The proposed algorithm, called CNP, is illustrated in Fig. 5.6. It contains a procedure, CNP-Growth, which is shown in Fig. 5.7.

The CNP algorithm consists of three phases. The first phase involves steps to normalize each time-series sequence in the database and then apply the Haar wavelet transform to convert each time-series sequence into a low resolution (step 3). The second phase involves steps to scan the transformed database once to find all frequent 1-patterns and construct a projected database for each 1-pattern found as described in Section 5.6. If any two frequent 1-patterns have the same projected database, the larger one is merged by the smaller one (step 4). The third phase involves steps for the pattern extension and pattern restoring (steps 5-13). That is, for each frequent 1-pattern, we call

the CNP-Growth procedure to enumerate the closed patterns. Then, we restore the patterns in the high resolution from the patterns mined in the low resolution.

Furthermore, we find all frequent 1-patterns in the high resolution from the original database. Finally, for each closed pattern found, we check if it can be merged by the other patterns.

Algorithm: CNP

Input: a time-series database DB, a minimum support threshold δ

Output: all closed patterns CP_L and CP_H in the low and high resolutions, respectively 1 Initialize CP_L and CP_H to ∅;

2 Normalize all time-series in DB;

3 Apply the Haar wavelet transform to each time-series in DB and store the transformed time-series in RDB;

4 Let P1 be all frequent 1-patterns found in RDB and merge similar 1-patterns into one if they have the same projected database;

5 for each 1-pattern p ∈ P1 do

6 Let PDB be the projected database of p;

7 call CNP-Growth (p, PDB, δ^{, P1, CP}^L^{, CP}^H^);

8 Restore p and each of the patterns merged by p respectively to find the frequent 2-patterns and 3-patterns in the high resolution and add them to CP_H if they are closed;

9 end for

10 Find frequent closed 1-patterns in the high resolution and add them to CP_H; 11 for each closed pattern p in the low and high resolutions do

12 Use its max-min bound to decide if it is a representative pattern;

13 end for

14 return CP_L and CP_H;

Fig. 5.6. The CNP algorithm.

Procedure: CNP-Growth (p, PDB, δ^{, P1, CP}^L^{, CP}^H⁾

Input: a k-pattern p, a projected database PDB, a minimum support threshold δ^{, and a} set of frequent 1-patterns P1

Output: a set of closed patterns CP_L in the low resolution and a set of closed patterns CP_H in the high resolution

1 if (p is closed) then 2 Add p to CP_L; 3 end if

4 if (p passes the pre-pruning strategy and there exists an adjacent frequent 1-pattern q that appears right after p) then

5 Append q to p to form a candidate (k+1)-pattern p’;

6 if (p do not pass the post-pruning strategy) then 7 Remove q from P1;

8 end if

9 Build a new projected database PDB’ for p’ and derive the merge list of p’;

10 if (p’ is frequent) then

11 call CNP-Growth (p’, PDB’, δ^{, P1, CP}^L^{, CP}^H^);

12 Restore p’ and each of the patterns merged by p’ respectively to find the frequent 2k-patterns and (2k+1)-patterns in the high resolution and add them to CP_H if they are closed;

13 end if

14 for each frequent pattern m’ that is extended from p’s merge list but not merged by p’ do

15 Build a projected database PDB_m’ for m’ and derive the merge list of m’;

16 call CNP-Growth (m’, PDB_m’, δ^{, P1, CP}^L^{, CP}^H^);

17 Restore m’ and each of the patterns merged by m’ respectively to find the frequent 2k-patterns and (2k+1)-patterns in the high resolution

and add them to CP_H if they are closed;

18 end for 19 end if

20 return CP_L and CP_H;

Fig. 5.7. The CNP-Growth procedure.

In the CNP-Growth procedure, upon getting a frequent k-pattern p, we first check it with the already mined patterns in CP_L to determine if it is closed in steps 1-3. If it is closed, we then add it to CP_L and extend it.

Next, p is checked against the pre-pruning strategy and the condition of whether there exists an adjacent frequent 1-pattern q appears right after p. If so, we append q to p to form a candidate pattern p’ in steps 4-5. Otherwise, we stop growing p. Subsequently, we use the post-pruning strategy to prune unnecessary frequent 1-patterns in steps 6-8.

The projected database of p’ can be derived from the projected database of p in step 9. That is, we extend each projection in p’s projected database and check if the newly extended projection is similar to p’. Accordingly, we obtain a new projected database for p’. If the support of p’is not less than δ, it is a frequent (k+1)-pattern. To further extend p’ to find longer patterns, we recursively call the CNP-Growth procedure in step 11.

Besides, we use the merge list of p to find out which patterns can be merged by p’.

For each k-pattern m in p’s merge list, we extend m to find a candidate (k+1)-pattern, m’.

If m’ is similar to p’ and both of them have the same projected database, m’ is merged by p’ and stored in the merge list of p’. In contrast, if m’ has a different projected database from p’ and its support is not less than δ, it is a frequent (k+1)-pattern. Thus, we build the projected database for m’ and derive its merge list in steps 14-15. And we extend m’

to find frequent super-patterns by calling the CNP-Growth procedure in step 16. The CNP-Growth procedure is recursively called until no more patterns can be generated.

To restore each k-pattern p found in the low resolution to the high resolution, we first recompose the s-points of p. Recall that we compute the average between two adjacent s-points appeared in T[i, i+1] to obtain a transformed s-point located in S[i, i]

as shown in Fig. 5.3, where 1 < i < l−1 and l is the length of T. By referring back to the original database, each s-point in the low resolution can be recomposed into two s-points in thehigh resolution.

Since p has k s-points, it can be restored to a candidate pattern that has 2k s-points.

That is, assume p appears in S[j−2k+2, j], we can obtain a restored pattern q1 of length 2k which appears in T[j−2k+2, j+1]. Likewise, for a pattern p’ of length k+1 in the low resolution, it can be restored to a candidate (2k+2)-pattern in the high resolution.

However, no candidate of length 2k+1 is restored. Thus, we use p to derive a pattern q₂ of length 2k +1 which appears in T[j−2k+2, j+2]. In other words, we can use p to find two candidate patterns in the high resolution.

To determine the supports of q₁ and q₂, we first restore each projection in p’s projected database and check if it is similar to q₁. If yes, it is stored in q₁’s projected database. As the projected database of q₁ is built, we further use it to derive the projected database of q₂ by extending the projections in q₁’s projected database. As a result, both their supports are calculated. If their supports are not less than δ, they are frequent 2k-pattern and frequent (2k+1)-pattern. We then use the patterns in the closed pattern pool to check whether q₁ and q₂ are closed. For example, assume that δ^{= 2,}ε⁼ 0.1, and an original database contains two transactions: T1 =

⎭⎬

0 . From the transformed database, we find that

p = ⎭⎬⎫

0 is a frequent 1-pattern in the low resolution. When we restore it to the high

resolution, we may obtain two candidate patterns, including q1 =

⎭⎬

Note that, each k-pattern m merged by p in the low resolution, is also restored to generate a candidate 2k-pattern and a candidate (2k+1)-pattern in the high resolution in

order to verify whether they can be merged by the restored patterns generated from p. If not, we compare them with the already mined closed patterns and output them to the closed pattern pool if they are closed.

It is shown that frequent 1-patterns in the low resolution are restored to find frequent 2-patterns and frequent 3-patterns in the high resolution; however, no frequent 1-patterns in the high resolution are retrieved. Thus, we mine frequent closed 1-patterns from the original database and add them to the closed pattern pool in step 10 of the CNP algorithm.

Once we have obtained all closed patterns in both low and high resolutions, we then proceed to select representative patterns among all closed patterns in each resolution in steps 11-13 of the CNP algorithm. Recall that while we build the projected database of a pattern, we also compute its max-min bound. We then use this information to select the representative pattern. For a pair of closed k-patterns, if they have identical max-min bounds, we choose the smaller pattern to be the representative pattern. As a result, we can obtain all representative closed patterns in both low and high resolutions.

Lemma 5.5 Each pattern found by the CNP algorithm is frequent and closed.

Proof. In the CNP algorithm, the second phase involves steps to find all frequent 1-patterns by scanning the transformed database once. The third phase involves the CNP-Growth procedure to recursively find longer patterns. That is, we extend a frequent k-pattern p by appending an adjacent frequent 1-pattern q which appears right after p to form a candidate (k+1)-pattern. Then we check the support of this candidate. If its support is greater than δ, it is a frequent (k+1)-pattern. As a result, we have assured that every pattern found in the low resolution is frequent. On the other hand, every frequent k-pattern p found in the low resolution is restored to two candidate patterns in the high resolution. We further check if these two candidate patterns are frequent. In addition, we mine frequent 1-patterns from the original database. Therefore, every pattern found in the high resolution is frequent. Moreover, upon getting a frequent

pattern in the low resolution, we always check it with the already mined patterns in CP_L to determine if it is closed. If it is closed, it is added to CP_L. Similarly, for every restored frequent pattern or frequent 1-pattern found in the high resolution, we check it with the already mined patterns in CP_H and add it to CP_H if it is closed. Therefore, every pattern found by the CNP algorithm is frequent and closed.

Lemma 5.6 Every closed pattern can be found by the CNP algorithm.

Proof. Since we scan the transformed database once to find all frequent 1-patterns, every frequent 1-pattern in the low resolution can be found by the CNP algorithm. For each frequent k-pattern p in the low resolution, we combine it with an adjacent frequent 1-pattern q which appears right after p to generate all its frequent super-patterns of p.

Thus, every frequent k-pattern in the low resolution can be found by the CNP algorithm.

For each frequent k-pattern found in the low resolution, we restore it to find two candidate patterns in the high resolution and eliminate the infrequent ones. Moreover, we scan the original database once to find all frequent 1-patterns in the high resolution.

Therefore, every frequent k-pattern in the high resolution can be found by the CNP algorithm. Once a frequent k-pattern is found either in the low resolution or in the high resolution, we use the already mined patterns in the pool to eliminate non-closed patterns. Therefore, every closed pattern can be found by the CNP algorithm.

Theorem 5.1 The CNP algorithm enumerates all closed patterns in both low and high resolutions in the database.

Proof. By Lemma 5.5, every pattern found by the CNP algorithm is frequent and closed.

By Lemma 5.6, every closed pattern can be found by the CNP algorithm. Therefore, we can conclude that the CNP algorithm enumerates all closed patterns in the database in both low and high resolutions.

Theorem 5.2 The time and space complexities of the CNP algorithm are O(|D|*log|D| +

|N_L|*|D| + |CP_L|² + |CP_H|²) and O(lp*|D| + lv*(|CP_L| + |CP_H|)), respectively, where the size of a time-series database is |D|, the number of nodes in the numerical pattern tree

in the low resolution is |N_L|, the numbers of closed patterns in the low and high resolutions are |CP_L| and |CP_H|, respectively, the average length of the frequent pattern is lv, and the length of the longest frequent pattern is lp.

Proof. According to [19], the time complexity of transforming a time-series in the high resolution into a time-series in the low resolution is bounded by O(l), where l is the length of a time-series. Since the size of a time-series database is |D|, the time complexity of the Haar wavelet transform is bounded by O(|D|). To obtain all frequent 1-patterns in the low resolution, we sort all 1-patterns in the transformed database and check whether each 1-pattern is frequent. It requires O(|D|*log|D|) time to perform this task. Next, let us consider the mining process in the low resolution. To extend a frequent k-pattern p, we append an adjacent frequent 1-pattern that appears right after p to form its super-pattern and meanwhile derive a new projected database for the super-pattern.

This process takes O(|D|) time since the maximum size of projected databases is |D| in the worse case. Since there are |N_L| nodes in the numerical pattern tree in the low resolution, the time-complexity of the CNP algorithm in the low resolution is bounded by O(|D| + |D|*log|D| + |N_L|*|D|) = O(|D|*log|D| + |N_L|*|D|). Next, let us consider the restoring operation in the CNP algorithm. Since p can be restored into two frequent patterns at most in the high resolution, the maximum number of frequent patterns in the high resolution is bounded by 2*|N_L|. During the restoring process, p’s projected database is scanned once to check if the restored patterns are frequent, and hence the time complexity of the restoring operation is bounded by O(|N_L|*|D|). The time complexity of checking if a pattern is closed is bounded by O(|CP_L|) and O(|CP_H|) with respect to the low and high resolutions. Moreover, the time complexity of the merge operation in the low resolution is bounded by O(|CP_L|²) because each closed pattern is checked against the other |CP_L| − 1 patterns to determine if it is a representative pattern.

Similarly, the time complexity of the merge operation in the high resolution is bounded by O(|CP_H|²). Besides, the time complexity of mining frequent 1-patterns in the high

resolution is bounded by O(|D|*log|D|). Therefore, the time complexity of the CNP algorithm is bounded by O(|D| + |D|*log|D| + |N_L|*|D| + 2*|N_L|*|D| + |CP_L| + |CP_H| +

|CP_L|² + |CP_H|² + |D|*log|D|) = O(|D|*log|D| + |N_L|*|D| + |CP_L|² + |CP_H|²). In order to select the representative patterns, we record the max-min bounds for the closed patterns in both low and high resolutions. The memory space used to preserve this information is thus bounded by O(lv*(|CP_L| + |CP_H|)). Since the CNP algorithm is processed in a DFS manner, the maximum number of nodes that kept in the memory is bounded by lp.

Moreover, each node requires O(|D|) space to maintain its projected database and O(lv) space to store its max-min bounds. To store all closed patterns in both low and high resolutions, we need O(|CP_L| + |CP_H|) space. Therefore, the space complexity of the CNP algorithm is bounded by O(|D| + |D| + lv*(|CP_L| + |CP_H|) + lp*(|D| + lv) + |CP_L| +

|CP_H|) = O(lp*|D| + lv*(|CP_L| + |CP_H|)).

在文檔中時間序列資料庫之封閉性樣式探勘 (頁 74-82)