Level-1 Major Trends Match - EPLR Similarity Retrieval

Chapter 3 Proposed Subsequence Similarity Matching Method

3.3 EPLR Similarity Retrieval

3.3.1 Level-1 Major Trends Match

By intuition and some observations, we believe that two time series with the same shape are similar. On the contrary, if two time series are similar, their shape, especially the major trends is similar. Figure 3-6 shows the importance of major trends. In Figure 3-6, there are four time series of different class in a data set.

Obviously, each time series contains two significant patterns made of major trends, which are either “Down-Up-Down” or “Up-Down-Up”. The combination of these two patterns generates four different kinds of time series. If we have a time series now, we can easily identify which class it is by its major trends. Concerning this characteristic, a method of major trends match is proposed based on EPLR. The EPLR segments a time series into a sequence of angles. With these angles, the trends of data can be easily determined. The relationship between EPLR and major trends is described in Figure 3-7. For simplicity, data overlap is not shown in the figures from now on. As shown in Figure 3-7, in EPLR, the trends of segments of both time series are

“up-down-down-up” and “up-up-down-up” respectively, but their sequences of major

Figure 3-6 Importance of Major Trends

Figure 3-7 The relationship between EPLR and major trends

The concept of the proposed major trends match is shown in Figure 3-8. We merge multiple segment trends into a major trend, and then do subsequence matching by a sequence of major trends. In the first place, we assign each segment a segment trend. They are u for up, d for down, f for flat, fu for flat up and fd for flat down.

Figure 3-8 Concept of Major Trends Match

Corresponding to Figure 3-9, five categories of segment trends are defined as follows:

Def 2:

1. u, if degrees > α 2. d, if degrees < -α 3. f, if β ≥ degrees ≥ -β 4. f_u, if α ≥ degrees > β 5. fd, if -α ≤ degrees < -β

α and β are user-defined parameters, and α > β.

Before discussing how to define α and β, some characteristics of datasets need to mention first. As described above, a time series is divided into equal segments which are represented by angles. However, according to the different datasets, the ranges of angles vary a lot. The ranges of some data sets may be distribute normally as our common sense, such as from 75 degrees to -75 degrees, but for some data sets, the angles may range from 1 degree to -1 degree, which is too small to test the difference by eyes of human being. As examples shown in Figure 3-10, the range of angles in R1 is large, but in R2, all angles are near 0 degree. Although the angles for some data sets are extremely small, the difference still exists. For instance, the range of a data set is from 2 degrees to -2 degrees. In this case, we can say that positive angles are much dissimilar to negative angles and 0.2 degrees is distinct from 1.5 degrees. Thus, parameters α and β are dependent on different data sets. We can decide the range of angles by running one or two time series. One thing should be noticed that the range of angles is almost the same for all time series in a data set, so there is no need to worry about the change of the range. Parameters α and β are specified as follows:

Def 3:

α = maximum degrees × a β = maximum degrees × b

a and b are user-defined parameters, and 1 > a > b > 0.

Next, we merge multiple segment trends into a major trend. Before introducing how to generate a major trend, some definitions are given as follows:

Def 4:

1. (x₁, x₂, …, x_v )z: Randomly select z items from v items in ( ), where z ≤ v, and items can be selected repeatedly.

Example: (u, d, f)2, including 9 possibility, could be uu, dd, ff, ud, du, uf, fu, df or fd.

2. <x>: The item x in < > appears once or not at all.

Example: <u> could be none or u. <du> could also be none or du.

3. [x]: The item x in [ ] appears zero or more times.

Example: [u] could be none, u, uu,or uuuu. [du] could be none, du or dudu.

4. |x|: The item x in | | appears one or more times.

Example: |u| could be u, uu or uuuuu. |du| could be dudu or dududu.

With these definitions given above, it is easy to define major trends. There are also five kinds of major trends, U, D, F, FU and FD. U is formed by one or multiple segment trends u with only one segment trend f, f_u or f_d inserted between two u or inserted at last. D, resembling U, is composed of one or multiple segment trends d with only one segment trend f, f_u or f_d inserted between two d or inserted at last. Since u is distinct from d apparently, u never merges with d in U or D. As for F, more than two f, f_u or f_d are able to form a F. Only one f, f_u or f_d is not qualified to be significant.

FU and FD are two special kinds of F. F converts to FU if the number of fu minus the number of f_d more than 3. F converts to F_D if the number of f_d minus the number of f_u more than 3. Every possible combination for major trends is summarized in Table 3-2.

Table 3-2 The summarization of major trends

Major Trend Definition Example

U ||u|<(f, fu, fd)1>| u, uuu, ufu, uuufuufuu, uufduu

The time series already transformed via EPLR is further represented by a sequence of major trends to perform a rough match first. The idea of major trends match is already shown in Figure 3-8. Each major trend should match with another major trend with the same feature, namely, U should match with U, D should match with D and F is not allowed to match with U or D. However, it is not the only kind of match that a sequence of major trends matches with exactly the same sequence of major trends. Major trends F, F_U and F_D are capable of matching with each other for the reason that FU and FD are two special kinds of F and should be much similar to F.

Besides, F_U is allowed to match with U and F_D is allowed to match with D because we don’t want any possible miss and they both represent up or down. The following are our matching rules for each major trend:

Def 5:

1. U: U, F_U, Mix of U & F_U 2. D: D, FD, Mix of D & FD

3. F: F, F_U, F_D

4. FU: U, F, FU, FD, Mix of U & FU

For instance, U can match with U, F_U and Mix of U & F_U, such as UF_UU.

Example: Consider a sequence of major trends, DFUDFUFDD, which is the same as the following sequences of major trends:

1. DFDFUFD 2. DUDFUFD 3. DFDFUD 4. DUDFUD

For simplicity, we do not make any difference among F, FU and FD. In the above four sequences of major trends, F can be replaced by F_U or F_D, so any subsequence which is the same as the above four sequences can be matched.

3.3.2 Level-2 Detailed Subsequence Match

As shown in Figure 3-8, one major trend only matches with another major trend.

Each major trend has to calculate the distance with the major trend it matches and the distance is called a major trend distance. Then the final distance between a query and a subsequence which is similar to query is the sum of all major trend distance.

The following is the formal definition:

Def 6:

Given two time series Q and S, they are major trends matched sequences of the same length r, where:

Q = {qt1, qt2, …, qti, …, qtr} S = {st₁, st₂, …, st_i, …, st_r}

qti and sti are ith major trend of Q and S.

The ith major trend distance is noted as MTD(qt_i, st_i) and the distance between S and Q, Dist(S, Q), is computed in Eq. (5)

An asymmetric distance function modifying DTW is used to compute the major trend distance based on query because our concern is that how subsequences similar to query but not how query similar to subsequences. Considering two times series Q and S, a distance function is asymmetric if the distance between Q and S, Dist(Q, S), is distinct from the distance between S and Q, Dist(S, Q). We called this DTW-based distance function Modified DTW (MDTW) to discriminate from original DTW. We allow expanding and compressing the time axis to provide a better match, but the each element of query needs to match and is always compared once. The only consideration is how the shape of the matched subsequence is similar to query. We do not want the length or noise of the matched subsequence affect the result of similarity matching, so the number of compared elements is fixed to the number of elements contained in query. That is, every element in the query should be matched. In our application, time is not an issue because the data needed to calculate is localized.

Segments in a major trend are much less than the whole time series. The definition of distance function MDTW is as follows:

Def 7:

Given two time series Q and S, of length n and m respectively, where:

Q = {q1, q2, …, qi, …, qn}

Figure 3-11 MDTW Algorithm

An algorithm used by dynamic programming for MDTW distance is given in Figure 3-11. An example is also shown in Figure 3-12. One thing should be noticed that although the step-patterns are the same for DTW and MDTW, one of the directions have a little different meaning. Due to the definition that each query element only compares one time, the right direction of warping path means that query skips one element to compare with next element. In Figure 3-12, the cells underlined mean these elements are skipped and their distances are not included in the final distance. The mapping example of DTW and MDTW is described in Figure 3-13 and Figure 3-14, respectively. As we can see in Figure 3-13 and Figure 3-14, there are three time series Q, S1 and S2, where Q = {6, 1, 6, 7, 1}, S1 = {2, 7, 1, 7, 2, 1, 1} and S2 = {4, 3, 6, 1}. It is obvious that Q is more similar to S1 than to S2. Let us consider distance function of DTW and MDTW, respectively. DTW(S1, Q) = 7 and DTW(S2, Q) = 5. Instead of our intuition, the distance between S1 and Q is more than the distance between S2 and Q for comparing every element. As for MDTW, MDTW(S1, Q) = 2 and MDTW(S2, Q) = 5. The result is more reasonable because some elements are skipped to find if the shape is similar to Q.

Figure 3-12 Example of MDTW

Figure 3-13 Mapping of DTW

Figure 3-14 Mapping of MDTW

the subsequence similarity.

Def 8: Given two subsequences Q and S which are both represented by EPLR, Q and S are similar if they satisfy the following two conditions:

1. Major trends of Q and S match with each other.

2. Dist(Q,S) < γ where Dist(Q,S) is the sum of all major trend distances as computed in Eq. (5) and γ ≥ 0 is user-specified parameter.

In our definition of subsequence similarity, the first condition can be referred to as an index quickly to check and filter out a lot of time series which are not qualified.

Our experiment shows that major trends match is reliable on the ground truth acquired by DTW. The second condition makes sure which one is truly similar and Dist(Q,S) can be a basis of ranking.

Chapter 4 Experimental Results

In this chapter, we evaluate the performance of the proposed subsequence similarity matching mechanism. As for implementation environment, all the programs are implemented in Java SE 6 and run on hardware with Intel Pentium 4 CPU 1.6 GHz with 512 Mbytes memory and software with Windows XP Professional Version 2002 system. The experimental results include two parts. In section 4.1, the experimental validation of EPLR is shown by clustering and classification. In section 4.2, we demonstrate the similarity retrieval is effective and efficient.

在文檔中時間序列資料處理與相似性擷取 (頁 35-46)