Subsequence Similarity Matching - Problem Definition and Background

Chapter 2 Problem Definition and Background

2.5 Subsequence Similarity Matching

Figure 2-10 Need to align segments to compare two time series

2.5 Subsequence Similarity Matching

Subsequence similarity matching is the retrieval of similar time series. Given a database D and a query Q, where D and Q both includes multiple time series. The length of each time series in D is larger than the length of each time series in Q. The task tries every time series in Q to find similar subsequences of every time series in D.

To complete this task, a subsequence similarity should be defined in advance. For

flexibility, each discovered similar subsequence for a query time series may have different length. The task is illustrated in Figure 2.11.

Figure 2-11 Subsequence Similarity Matching

Chapter 3 Proposed Subsequence Similarity Matching Method

3.1 Matching Process Overview

The whole matching process consists of two parts. In the first part, an approximation of the data is created, which can retain the features of the shape or trend of the original data. The original data should be faithfully represented by this approximate representation. In our work, we proposed a representation, called Equal Piecewise Linear Representation (EPLR). Goals of our EPLR include reducing dimensionality, preserving trends information and simple algorithm. For a start, both the data in the database and the query data have to be transformed into EPLR. After this work, the process proceeds to the second part.

In the second part, EPLR similarity retrieval is performed by a shape-based similarity measure with two levels. First level tries to match subsequences in larger sequences with the same major trends, which can be easily derived from EPLR. This level is also a kind of index to filter out lots of unqualified time series. Second level is detailed match that the distance of the matched subsequences is calculated by modified DTW to check whether they are really similar or not. Effectiveness and efficiency are important issues in this part.

3.2 Equal Piecewise Linear Representation (EPLR)

In this section, we proposed a new representation, called Equal Piecewise Linear Representation, to segment a time series effectively and efficiently. EPLR is a special type of PLR, but is much simpler and easier to implement. In section 3.2.1, we

describe the concept of EPLR. In section 3.2.2, advantages of EPLR are discussed.

3.2.1 Concept of EPLR

3.2.1.1 Angle Representation

In EPLR, a time series of length n is divided into equal segments. Each segment is formed by k points. Not like PAA, which uses mean value of the segment as the new representation, here linear regression is applied to each equal segment and the new sequence as a dimension-reduced representation is composed of the angles of the best-fit line of segments. The example of EPLR is showed in Figure 3-1. We can also easily see the difference between PAA and EPLR in Figure 3-2. As representing by angles, normalization can be totally ignored. The only thing we care about is the trend of data, not the exact value.

Figure 3-1 Example of EPLR

3.2.1.2 Data Overlapping

Using angles of segments with equal length as new representation of a time series seems a good idea, but thinking deeply, there is a problem that we must deal with. The problem is illustrated in Figure 3-3.

Figure 3-3 Problem of EPLR

In Figure 3-3, we can see that some loss of information occurs, such as an up trend followed by an up trend in the last. It is uncertain about the trend between two up trends because it may be either an up trend or a down trend. That is, the data may rise all along across two segments, or may rise first and then suddenly fall a lot near the breakpoint with another rise following. For this reason, overlapping the data to create a new segment is used to solve the problem. The work of overlapping data is described in Figure 3-4 and EPLR with 1/2 data overlapping is shown in Figure 3-5.

By compensating the loss of information, it is easy to discriminate between above the

two situations. The range of overlapping data is discussed in Section 4.

Figure 3-4 Data overlapping

Figure 3-5 EPLR with 1/2 data overlapping

3.2.1.3 Formal Definition

Def 1: A time series S of length n is divided into m equal segments. Each segment is

S’ = {s’₁, s’₂, …, s’_i…,s’_m}

The time series is reduced from dimension n to dimension m, and each new data represents a set of original data with equal size.

3.2.2 Advantages of EPLR

EPLR segments a time series equally with data overlap and find a trend of each segment by linear regression. In the following, some advantages of our proposed EPLR are discussed.

1. As mentioned above, a time series is reduced from higher dimension n to lower dimension m. Although allowing data overlap may increase the size of m, m is still much smaller than n.

2. The effect of time shifting and noise can be reduced by segmentation. As for time shifting, the Euclidean distance of two time series may be huge even if one time series is the other one which only shifts a data point to right or left. After segmentation, this kind of difference can be ignored. The situation of noise is the same.

3. Despite EPLR and PAA are both simple, fast and easy to implement, EPLR is

As shown in Figure 3-2, we can intuitively see that the trends of data are easily identified by EPLR. In addition to subjective perception, some more specific evidence is demonstrated in our experiment results.

4. Even though EPLR and PLR both discover trends of time series by linear regression, EPLR is much faster and simpler than PLR. Firstly, EPLR does not need any threshold to determine segments since the length of segments is fixed and the trend of data still can be represented effectively. Secondly, existing distance function can be easily applied to EPLR without aligning segments. Two different time series of the same length n can be transformed into the new representations of the same length m by EPLR, so one-to-one matching is available and a segment aligning is not necessary. Finally, the time complexity for EPLR is always θ(n) due to the fact that each segment only needs to do linear regression once. Compared to the algorithms of PLR, our proposed method is much better. The comparison between EPLR and the three algorithms of PLR is summarized in Table 3-1.

Table 3-1 The Comparison between three algorithms of PLR and EPLR Algorithm Define threshold Worst-case complexity

EPLR No θ(n)

Sliding Window PLR Yes θ(n²)

Top-Down PLR Yes θ(n³)

Bottom-Up PLR Yes θ(n²)

5. We do not need to normalize the raw data in some applications because our transformed data are represented by angles. The trends of data remain consistent no matter how we vertically shift the data..

3.3 EPLR Similarity Retrieval

In our similarity retrieval, a sequence is represented as a series of angles of equal segments and we find the subsequences of angles of equal segments that are similar to the query. In order to accomplish this task, a similarity measure is necessary. In this section, a 2-level similarity measure based on EPLR is proposed to define subsequence similarity. In section 3.3.1, level-1 major trends match is introduced to prune off a lot of non-qualified time series to speed the retrieval. Then section 3.3.2 discusses level-2 detailed subsequence match to compute the distance of major trends matched subsequences. Lastly, our subsequence similarity is defined in section 3.3.3.

3.3.1 Level-1 Major Trends Match

By intuition and some observations, we believe that two time series with the same shape are similar. On the contrary, if two time series are similar, their shape, especially the major trends is similar. Figure 3-6 shows the importance of major trends. In Figure 3-6, there are four time series of different class in a data set.

Obviously, each time series contains two significant patterns made of major trends, which are either “Down-Up-Down” or “Up-Down-Up”. The combination of these two patterns generates four different kinds of time series. If we have a time series now, we can easily identify which class it is by its major trends. Concerning this characteristic, a method of major trends match is proposed based on EPLR. The EPLR segments a time series into a sequence of angles. With these angles, the trends of data can be easily determined. The relationship between EPLR and major trends is described in Figure 3-7. For simplicity, data overlap is not shown in the figures from now on. As shown in Figure 3-7, in EPLR, the trends of segments of both time series are

“up-down-down-up” and “up-up-down-up” respectively, but their sequences of major

Figure 3-6 Importance of Major Trends

Figure 3-7 The relationship between EPLR and major trends

The concept of the proposed major trends match is shown in Figure 3-8. We merge multiple segment trends into a major trend, and then do subsequence matching by a sequence of major trends. In the first place, we assign each segment a segment trend. They are u for up, d for down, f for flat, fu for flat up and fd for flat down.

Figure 3-8 Concept of Major Trends Match

Corresponding to Figure 3-9, five categories of segment trends are defined as follows:

Def 2:

1. u, if degrees > α 2. d, if degrees < -α 3. f, if β ≥ degrees ≥ -β 4. f_u, if α ≥ degrees > β 5. fd, if -α ≤ degrees < -β

α and β are user-defined parameters, and α > β.

Before discussing how to define α and β, some characteristics of datasets need to mention first. As described above, a time series is divided into equal segments which are represented by angles. However, according to the different datasets, the ranges of angles vary a lot. The ranges of some data sets may be distribute normally as our common sense, such as from 75 degrees to -75 degrees, but for some data sets, the angles may range from 1 degree to -1 degree, which is too small to test the difference by eyes of human being. As examples shown in Figure 3-10, the range of angles in R1 is large, but in R2, all angles are near 0 degree. Although the angles for some data sets are extremely small, the difference still exists. For instance, the range of a data set is from 2 degrees to -2 degrees. In this case, we can say that positive angles are much dissimilar to negative angles and 0.2 degrees is distinct from 1.5 degrees. Thus, parameters α and β are dependent on different data sets. We can decide the range of angles by running one or two time series. One thing should be noticed that the range of angles is almost the same for all time series in a data set, so there is no need to worry about the change of the range. Parameters α and β are specified as follows:

Def 3:

α = maximum degrees × a β = maximum degrees × b

a and b are user-defined parameters, and 1 > a > b > 0.

Next, we merge multiple segment trends into a major trend. Before introducing how to generate a major trend, some definitions are given as follows:

Def 4:

1. (x₁, x₂, …, x_v )z: Randomly select z items from v items in ( ), where z ≤ v, and items can be selected repeatedly.

Example: (u, d, f)2, including 9 possibility, could be uu, dd, ff, ud, du, uf, fu, df or fd.

2. <x>: The item x in < > appears once or not at all.

Example: <u> could be none or u. <du> could also be none or du.

3. [x]: The item x in [ ] appears zero or more times.

Example: [u] could be none, u, uu,or uuuu. [du] could be none, du or dudu.

4. |x|: The item x in | | appears one or more times.

Example: |u| could be u, uu or uuuuu. |du| could be dudu or dududu.

With these definitions given above, it is easy to define major trends. There are also five kinds of major trends, U, D, F, FU and FD. U is formed by one or multiple segment trends u with only one segment trend f, f_u or f_d inserted between two u or inserted at last. D, resembling U, is composed of one or multiple segment trends d with only one segment trend f, f_u or f_d inserted between two d or inserted at last. Since u is distinct from d apparently, u never merges with d in U or D. As for F, more than two f, f_u or f_d are able to form a F. Only one f, f_u or f_d is not qualified to be significant.

FU and FD are two special kinds of F. F converts to FU if the number of fu minus the number of f_d more than 3. F converts to F_D if the number of f_d minus the number of f_u more than 3. Every possible combination for major trends is summarized in Table 3-2.

Table 3-2 The summarization of major trends

Major Trend Definition Example

U ||u|<(f, fu, fd)1>| u, uuu, ufu, uuufuufuu, uufduu

The time series already transformed via EPLR is further represented by a sequence of major trends to perform a rough match first. The idea of major trends match is already shown in Figure 3-8. Each major trend should match with another major trend with the same feature, namely, U should match with U, D should match with D and F is not allowed to match with U or D. However, it is not the only kind of match that a sequence of major trends matches with exactly the same sequence of major trends. Major trends F, F_U and F_D are capable of matching with each other for the reason that FU and FD are two special kinds of F and should be much similar to F.

Besides, F_U is allowed to match with U and F_D is allowed to match with D because we don’t want any possible miss and they both represent up or down. The following are our matching rules for each major trend:

Def 5:

1. U: U, F_U, Mix of U & F_U 2. D: D, FD, Mix of D & FD

3. F: F, F_U, F_D

4. FU: U, F, FU, FD, Mix of U & FU

For instance, U can match with U, F_U and Mix of U & F_U, such as UF_UU.

Example: Consider a sequence of major trends, DFUDFUFDD, which is the same as the following sequences of major trends:

1. DFDFUFD 2. DUDFUFD 3. DFDFUD 4. DUDFUD

For simplicity, we do not make any difference among F, FU and FD. In the above four sequences of major trends, F can be replaced by F_U or F_D, so any subsequence which is the same as the above four sequences can be matched.

3.3.2 Level-2 Detailed Subsequence Match

As shown in Figure 3-8, one major trend only matches with another major trend.

Each major trend has to calculate the distance with the major trend it matches and the distance is called a major trend distance. Then the final distance between a query and a subsequence which is similar to query is the sum of all major trend distance.

The following is the formal definition:

Def 6:

Given two time series Q and S, they are major trends matched sequences of the same length r, where:

Q = {qt1, qt2, …, qti, …, qtr} S = {st₁, st₂, …, st_i, …, st_r}

qti and sti are ith major trend of Q and S.

The ith major trend distance is noted as MTD(qt_i, st_i) and the distance between S and Q, Dist(S, Q), is computed in Eq. (5)

An asymmetric distance function modifying DTW is used to compute the major trend distance based on query because our concern is that how subsequences similar to query but not how query similar to subsequences. Considering two times series Q and S, a distance function is asymmetric if the distance between Q and S, Dist(Q, S), is distinct from the distance between S and Q, Dist(S, Q). We called this DTW-based distance function Modified DTW (MDTW) to discriminate from original DTW. We allow expanding and compressing the time axis to provide a better match, but the each element of query needs to match and is always compared once. The only consideration is how the shape of the matched subsequence is similar to query. We do not want the length or noise of the matched subsequence affect the result of similarity matching, so the number of compared elements is fixed to the number of elements contained in query. That is, every element in the query should be matched. In our application, time is not an issue because the data needed to calculate is localized.

Segments in a major trend are much less than the whole time series. The definition of distance function MDTW is as follows:

Def 7:

Given two time series Q and S, of length n and m respectively, where:

Q = {q1, q2, …, qi, …, qn}

Figure 3-11 MDTW Algorithm

An algorithm used by dynamic programming for MDTW distance is given in Figure 3-11. An example is also shown in Figure 3-12. One thing should be noticed that although the step-patterns are the same for DTW and MDTW, one of the directions have a little different meaning. Due to the definition that each query element only compares one time, the right direction of warping path means that query skips one element to compare with next element. In Figure 3-12, the cells underlined mean these elements are skipped and their distances are not included in the final distance. The mapping example of DTW and MDTW is described in Figure 3-13 and Figure 3-14, respectively. As we can see in Figure 3-13 and Figure 3-14, there are three time series Q, S1 and S2, where Q = {6, 1, 6, 7, 1}, S1 = {2, 7, 1, 7, 2, 1, 1} and S2 = {4, 3, 6, 1}. It is obvious that Q is more similar to S1 than to S2. Let us consider distance function of DTW and MDTW, respectively. DTW(S1, Q) = 7 and DTW(S2, Q) = 5. Instead of our intuition, the distance between S1 and Q is more than the distance between S2 and Q for comparing every element. As for MDTW, MDTW(S1, Q) = 2 and MDTW(S2, Q) = 5. The result is more reasonable because some elements are skipped to find if the shape is similar to Q.

Figure 3-12 Example of MDTW

Figure 3-13 Mapping of DTW

Figure 3-14 Mapping of MDTW

the subsequence similarity.

Def 8: Given two subsequences Q and S which are both represented by EPLR, Q and S are similar if they satisfy the following two conditions:

1. Major trends of Q and S match with each other.

2. Dist(Q,S) < γ where Dist(Q,S) is the sum of all major trend distances as computed in Eq. (5) and γ ≥ 0 is user-specified parameter.

In our definition of subsequence similarity, the first condition can be referred to as an index quickly to check and filter out a lot of time series which are not qualified.

Our experiment shows that major trends match is reliable on the ground truth acquired by DTW. The second condition makes sure which one is truly similar and Dist(Q,S) can be a basis of ranking.

Chapter 4 Experimental Results

In this chapter, we evaluate the performance of the proposed subsequence similarity matching mechanism. As for implementation environment, all the programs are implemented in Java SE 6 and run on hardware with Intel Pentium 4 CPU 1.6 GHz with 512 Mbytes memory and software with Windows XP Professional Version 2002 system. The experimental results include two parts. In section 4.1, the experimental validation of EPLR is shown by clustering and classification. In section 4.2, we demonstrate the similarity retrieval is effective and efficient.

4.1 Experiments of EPLR

Keogh et al. [31] introduced two methods, clustering and classification, to evaluate similarity measure. Lin et al. [21] also utilized these two methods to validate their Symbolic Aggregate approXimation (SAX) representation. Here, we verify the reliability of EPLR by these two methods.

4.1.1 Clustering

Clustering is one of the most common data mining tasks. The Cylinder-Bell-Funnel data set which consists of three classes is shown in Figure 4-1.

The goal of clustering is to separate each class in Figure 4-1 into a cluster, time series in each class respectively, such as (1), (2) and (3) in a cluster and (4), (5) and (6) in another cluster. We experiment clustering on 20 datasets with class labels known (data

randomly chosen. Each cluster has a centroid and each time series in a cluster has a distance with the centroid of its cluster. If the distance between a time series and the centroid in the same cluster is larger than the distance with the centroid of another

在文檔中時間序列資料處理與相似性擷取 (頁 27-0)