Organization of the Thesis - 時間序列資料處理與相似性擷取

Chapter 1 Introduction

1.3 Organization of the Thesis

The rest of this thesis is organized as follows. In Chapter 2, we discuss basic definitions and the terminology used in the rest of the thesis. In Chapter 3, we present our proposed representation, called Equal Piece Linear Representation (EPLR) and introduce a subsequence similarity matching mechanism. In Chapter 4, we describe the experiments and the performance evaluations. In Chapter 5, we state the conclusion and future work.

Chapter 2 Problem Definition and Background

In this chapter, we introduce basic definitions and the terminology. In Section 2.1, time series data is introduced. Two basic distance functions as similarity measure are described in Section 2.2. Normalization for time series is depicted in Section 2.3. Two representations by segmentation are introduced in Section 2.4. Finally, we describe subsequence similarity matching in Section 2.5.

2.1 Time Series Data

A time series is a sequence of data points, typically measured every specific time interval which can be one minute, one hour, one day or longer. The notation for a time series with n data points is often represented as

S = {s1, s2, …, si, …, sn}

where si is the state record at time i. A time series database is also a sequence database, but a sequence is not necessarily a time series database. Any database consisting of sequences of ordered events with or without concrete notions of time can be a sequence database. For instance, customer shopping transaction sequences are sequence data, but may not be time series data.

2.2 Basic Distance Functions

2.2.1 Manhattan Distance

Given two time series Q and S, of the same length n, where:

Q = {q1, q2, …, qn} S = {s1, s2, …, sn}

The Manhattan distance D(Q, S) between Q and S is computed in Eq. (1)

D(Q, S) =

∑

ⁿ=

−

q

s

0 (1)

2.2.2 Euclidean Distance

Given two time series Q and S, of the same length n, where:

Q = {q₁, q₂, …, q_n} S = {s₁, s₂, …, s_n}

The Euclidean distance D(Q, S) between Q and S is computed in Eq. (2)

D(Q, S) =

∑

ⁿ=

( − )

q

s

(2) 2.2.3 Dynamic Time Warping (DTW)

Given two time series Q and S, of length n and m respectively, where:

Q = {q1, q2, …, qi, …, qn} S = {s1, s2, …, sj,…, sm}

Let D(i, j) be the minimum distance of matching between (q1,q2, …, qi) and (s1,s2, …, sj), the DTW distance can be computed in Eq. (3)

D (i, j) = distdtw(qi, sj) + min{D(i-1, j-1), D(i-1, j), D(i, j-1)} (3)

where distdtw(qi, sj) can be Manhattan distance or Euclidean distance. An algorithm for computing DTW distances by dynamic programming is shown in Figure 2-1. Here dist_dtw(q_i, s_j) is in terms of the Manhattan distance. The example of the process is presented in Figure 2-2. A path is called a warping path which is traced out by the gray cells in Figure 2-2. The direction of warping path noted as step-patterns is described in Figure 2-3.

Figure 2-1 DTW Algorithm

Figure 2-2 Example of DTW

Figure 2-3 Step-Patterns

2.3 Normalization

Given a time series S = {s₁, s₂, …, s_i, …, s_n}, it can be normalized by using its mean (μ) and standard deviation (σ) [17]:

⎭⎬

Normalization is highly recommended so that the global shifting and amplitude scaling of time series will not affect the distance between two time series. However, considering shape-based similarity, the different global shifting of time series still have the same shape, but amplitude scaling sometimes may indicate different time series which are impossible to be similar in shape. Therefore, when we compare the shape-based similarity between two time series, we don’t want global shifting to influence the consequence. A time series only needs to normalized by mean (μ):

{

⁻^μ ⁻^μ ⁻^μ

}

= s s s_n

Norm( ) ₁ , ₂ ,..., . In this thesis, the method we proposed does not depend on real values but angles, that is to say, we even do not need to normalize our

time series by mean (μ).

2.4 Two Representations by Segmentation

The data of time series may resemble other data in a period. These resembling data in a period can be noted as one representative data. How to decide the segments with representative data is referred to as segmentation problems. Segmentation problems can be classified into the following three categories.

1. Producing the representation of a time series S in K segments.

2. Producing the representation of a time series S that the maximum error for any segments can not be larger than a threshold.

3. Producing the representation of a time series S that the total error of all segments must be less than a threshold.

In this section, we introduce two kinds of representations by segmentation. In section 2.4.1, Piecewise Aggregate Approximation is described and then Piecewise Linear Representation is depicted in Section 2.4.2.

2.4.1 Piecewise Aggregate Approximation (PAA)

2.4.1.1 Concept of PAA

Keogh et al. [13] proposed PAA as a representation of time series data for dimensionality reduction. As shown in Figure 2-4, the algorithm of PAA transforms a time series from n dimensions to k dimensions which is lower than original data. PAA is also a segmentation of time series that the data is divided into k segments with equal length and the mean value of each segment forms a new sequence as the dimension-reduced representation.

A time series S of length n is represented in a k-dimensional space by a vector S’

Although the advantages of PAA are obvious that it is very fast, easy to implement, can reduce dimensionality and can be applied to existing distance function, there are still some disadvantages which are illustrated in Figure 2-5. As introduced previously that PAA reduces dimensionality via the mean values of equal segments. It is possible to lose some important information in some time series. The trends of the data, the sequences S1 and S2 in Fig 2-5, are hard to identify because of mean value.

They are much similar in PAA, but their trends are much different. Another problem of PAA is that in some applications the data need to be normalized before converting to the PAA representation. Figure 2-5 Problem of PAA

2.4.2 Piecewise Linear Representation (PLR)

2.4.2.1 Linear Regression: Method of Least Squares

Regression is one kind of statistic methods for modeling and analyzing numerical data. One of its objectives is to find the best-fit line of data. Linear regression is a form of regression analysis and its model parameters are linear

(4)

widely used to determine a trend line, the long-term movement in time series data.

Among the different criteria that can be used to define what is best-fit, the method of least squares measured by an error value is commonly used and very powerful. The error value as a measuring way is the sum of squared residuals, which is the difference between observed value and the value on the line at the same time point. The best-fit line is located while the error value is minimal. The concept of the best-fit line and least squares are shown in Figure 2-6.

Figure 2-6 Linear Regression

2.4.2.2 Algorithms for PLR

Despite a lot of existing algorithms with different names and slightly different implementation details, they can be grouped into three kinds of types [20]. In the following algorithms, linear regression with the method of least squares is applied to decide segments.

1. Sliding Windows: The Sliding Windows algorithm is simple and widely used in online process. At first, it fixes the most left data of an input time series for a potential segment, and then the segment is grown by including the data in the potential segment while the error value of linear regression does not exceed some threshold defined by the user. Therefore, a segment is formed by a subsequence

say, the best segment is computed for a larger and larger window. Figure 2-7 shows the process of the sliding windows algorithm.

Figure 2-7 Algorithm of Sliding Windows

2. Top-Down: The Top-Down algorithm starts from viewing the whole time series as a single segment and recursively partitions the time series into segments until the error value is less than some threshold defined by the user. The idea tries to find a best segmentation from all possible segmentations. It assumes each data point may be a breakpoint for partitioning and after both error values of the two partitioned segments for each breakpoint are computed, the best splitting location can be decided. At this time, time series is separated into two segments, and then both segments are checked to see if their error values are below some threshold defined by the user. If not, the whole process is repeatedly executed on the both segments respectively until the criteria of all the segments are met. The Top-Down algorithm is illustrated in Figure 2-8.

Figure 2-8 Algorithm of Top-Down

3. Bottom-Up: The Bottom-up algorithm is the opposite of the Top-Down algorithm.

In the beginning, the n-length time series is divided into n/2 segments, each of which is formed only by two data points. Assume that any pair of adjacent segments may be merged to generate a new segment, so the error value of each pair is computed and put in an error pool. Then the entire process recursively selects the pair with the least error value in the error pool to merge until the error value exceeds some threshold defined by user. Before deciding which pair is to merge next, the error values of the new pairs of the new merged segments and their adjacent segments must be calculated to update the error pool. The Bottom-Up algorithm is described in Figure 2-9.

Figure 2-9 Algorithm of Bottom-Up

2.4.2.3 Comparison of Time Complexity

In this part, we continue to discuss the running time of three different algorithms.

Since the running time for each method is highly data dependent, only two situations, best-case and worst-case are considered. For an easy understanding, we assume that the number of data points is n, and for n data points, a best-fit line as a segment is computed in θ(n) by linear regression.

1. Sliding Windows:

Best-case: Every time a segment is grown by including in one data point, the linear regression has to be applied. Hence, the less data points form a segment, the less running time is needed. In the best-case, every segment of the n-length time series is created by 2 points and the number of segments is n/2. The time complexity for each single segment is θ(2), so the total time complexity is θ(n).

Worst-case: It occurs if all n data points become a single segment and the only segment is grown from 2 up to n. The sum of computation time is

∑

ⁿ= =

Best-case: It occurs if every segment is always partitioned at the midpoint. In the first iteration, for each data point i as breakpoint, the algorithm computes the best-fit line from points 1 to i and the best-fit line from points (i+1) to n. The time complexity for each data point as breakpoint is θ(n) and for all n data points is θ(n²). Because this algorithm is recursively applied to partitioned segments, the total time complexity for best-case can be derived from a recursive equation: T(n) = 2T(n/2) + θ(n²). The equation can be solved to T(n) = θ(n²).

Worst-case: In contrast to best-case, the breakpoint of worst-case is always at near end rather than the middle and leaves one side with only 2 points. The total time complexity can also be derived from a recursive equation: T(n) = T(2) + T(n-2) + θ(n²). The equation stops after n/2 iterations and T(n) = θ(n³).

3. Bottom-Up:

Best-case: It occurs if we segment merges are needed. The first iteration always takes θ(n) for dividing n data points into n/2 segments of 2 points. Then the computation time of each pair of adjacent segments needs θ(n). Thus the time complexity is θ(n).

Worst-case: Compared to no merges in best-case, worst-case must continue to merge until one single segment is left. After first merge with two 2-length segments, the merges always involve a 2-length and long segment and the length of final single segment is n. Accordingly, each merge increases the length of the merged segment from length 2 to n. The time to compute a merged segment with length i is θ(i), and the time to reach a length n segment is θ(2) + θ(4) + … + θ(n) = θ(n²), so the time complexity for worst-case is θ(n²).

2.4.2.4 Problems of PLR

reduction but approximately obtains the trend of the data. However, some weaknesses of PLR are obvious. First of all, the threshold used to decide segments is difficult to specify. According to the threshold we choose, the consequence of segmentation can be different. Secondly, algorithms for PLR are complicated. Linear regression must be applied to a data point in the same segment more than once. Although how many times of linear regression needed is highly data dependent, it is inevitable to compute linear regression on a segment with the same data repeatedly. Time complexity may be an issue. Last but not least, PLR is not capable of using existing distance function directly. Most researches about similarity measurement with PLR need a rule to align segments to compare two sequences rather than comparing segments one-to-one as shown in Figure 2-10.

Figure 2-10 Need to align segments to compare two time series

2.5 Subsequence Similarity Matching

Subsequence similarity matching is the retrieval of similar time series. Given a database D and a query Q, where D and Q both includes multiple time series. The length of each time series in D is larger than the length of each time series in Q. The task tries every time series in Q to find similar subsequences of every time series in D.

To complete this task, a subsequence similarity should be defined in advance. For

flexibility, each discovered similar subsequence for a query time series may have different length. The task is illustrated in Figure 2.11.

Figure 2-11 Subsequence Similarity Matching

Chapter 3 Proposed Subsequence Similarity Matching Method

3.1 Matching Process Overview

The whole matching process consists of two parts. In the first part, an approximation of the data is created, which can retain the features of the shape or trend of the original data. The original data should be faithfully represented by this approximate representation. In our work, we proposed a representation, called Equal Piecewise Linear Representation (EPLR). Goals of our EPLR include reducing dimensionality, preserving trends information and simple algorithm. For a start, both the data in the database and the query data have to be transformed into EPLR. After this work, the process proceeds to the second part.

In the second part, EPLR similarity retrieval is performed by a shape-based similarity measure with two levels. First level tries to match subsequences in larger sequences with the same major trends, which can be easily derived from EPLR. This level is also a kind of index to filter out lots of unqualified time series. Second level is detailed match that the distance of the matched subsequences is calculated by modified DTW to check whether they are really similar or not. Effectiveness and efficiency are important issues in this part.

3.2 Equal Piecewise Linear Representation (EPLR)

In this section, we proposed a new representation, called Equal Piecewise Linear Representation, to segment a time series effectively and efficiently. EPLR is a special type of PLR, but is much simpler and easier to implement. In section 3.2.1, we

describe the concept of EPLR. In section 3.2.2, advantages of EPLR are discussed.

3.2.1 Concept of EPLR

3.2.1.1 Angle Representation

In EPLR, a time series of length n is divided into equal segments. Each segment is formed by k points. Not like PAA, which uses mean value of the segment as the new representation, here linear regression is applied to each equal segment and the new sequence as a dimension-reduced representation is composed of the angles of the best-fit line of segments. The example of EPLR is showed in Figure 3-1. We can also easily see the difference between PAA and EPLR in Figure 3-2. As representing by angles, normalization can be totally ignored. The only thing we care about is the trend of data, not the exact value.

Figure 3-1 Example of EPLR

3.2.1.2 Data Overlapping

Using angles of segments with equal length as new representation of a time series seems a good idea, but thinking deeply, there is a problem that we must deal with. The problem is illustrated in Figure 3-3.

Figure 3-3 Problem of EPLR

In Figure 3-3, we can see that some loss of information occurs, such as an up trend followed by an up trend in the last. It is uncertain about the trend between two up trends because it may be either an up trend or a down trend. That is, the data may rise all along across two segments, or may rise first and then suddenly fall a lot near the breakpoint with another rise following. For this reason, overlapping the data to create a new segment is used to solve the problem. The work of overlapping data is described in Figure 3-4 and EPLR with 1/2 data overlapping is shown in Figure 3-5.

By compensating the loss of information, it is easy to discriminate between above the

two situations. The range of overlapping data is discussed in Section 4.

Figure 3-4 Data overlapping

Figure 3-5 EPLR with 1/2 data overlapping

3.2.1.3 Formal Definition

Def 1: A time series S of length n is divided into m equal segments. Each segment is

S’ = {s’₁, s’₂, …, s’_i…,s’_m}

The time series is reduced from dimension n to dimension m, and each new data represents a set of original data with equal size.

3.2.2 Advantages of EPLR

EPLR segments a time series equally with data overlap and find a trend of each segment by linear regression. In the following, some advantages of our proposed EPLR are discussed.

1. As mentioned above, a time series is reduced from higher dimension n to lower dimension m. Although allowing data overlap may increase the size of m, m is still much smaller than n.

2. The effect of time shifting and noise can be reduced by segmentation. As for time shifting, the Euclidean distance of two time series may be huge even if one time series is the other one which only shifts a data point to right or left. After segmentation, this kind of difference can be ignored. The situation of noise is the same.

3. Despite EPLR and PAA are both simple, fast and easy to implement, EPLR is

As shown in Figure 3-2, we can intuitively see that the trends of data are easily identified by EPLR. In addition to subjective perception, some more specific evidence is demonstrated in our experiment results.

4. Even though EPLR and PLR both discover trends of time series by linear regression, EPLR is much faster and simpler than PLR. Firstly, EPLR does not need any threshold to determine segments since the length of segments is fixed and the trend of data still can be represented effectively. Secondly, existing distance function can be easily applied to EPLR without aligning segments. Two different time series of the same length n can be transformed into the new representations of the same length m by EPLR, so one-to-one matching is available and a segment aligning is not necessary. Finally, the time complexity for EPLR is always θ(n) due to the fact that each segment only needs to do linear regression once. Compared to the algorithms of PLR, our proposed method is much better. The comparison between EPLR and the three algorithms of PLR is summarized in Table 3-1.

Table 3-1 The Comparison between three algorithms of PLR and EPLR Algorithm Define threshold Worst-case complexity

EPLR No θ(n)

Sliding Window PLR Yes θ(n²)

Top-Down PLR Yes θ(n³)

Bottom-Up PLR Yes θ(n²)

5. We do not need to normalize the raw data in some applications because our transformed data are represented by angles. The trends of data remain consistent no matter how we vertically shift the data..

3.3 EPLR Similarity Retrieval

In our similarity retrieval, a sequence is represented as a series of angles of equal segments and we find the subsequences of angles of equal segments that are similar to the query. In order to accomplish this task, a similarity measure is necessary. In this section, a 2-level similarity measure based on EPLR is proposed to define subsequence similarity. In section 3.3.1, level-1 major trends match is introduced to prune off a lot of non-qualified time series to speed the retrieval. Then section 3.3.2

在文檔中時間序列資料處理與相似性擷取 (頁 16-0)