Chapter 1 Introduction
1.1 Motivations
In the field of medical science, chronic liver disease is one of the major health problems in Asia. Most patients routinely take blood tests to determine the level of liver enzymes, such as alanine aminotransferase (ALT) and aspartate aminotransferase (AST) in order to diagnose their liver cell damage. The measurements of these two liver enzymes have been widely recognized as the indicators of liver disease [3][34][49].
3
According to [34], elevated ALT and AST are associated with a significantly increased standardized mortality ratio. Pockros et al. [48] evaluated the effect of oral IDN-6556, an antiapoptotic caspase inhibitor, on the patients with elevated ALT and AST, where the levels of both liver enzymes were recorded in two time-series sequences. The result indicated that IDN-6556 progressively lowered the ALT and AST levels of the patients in a treatment period of 14 days.
Many diseases are believed to be associated with different indicators. If we could identify which indicators are likely to relate to a disease, we could routinely track the measurements of these indicators to prevent the illness or to predict the possible risks.
We further speculate other real-world applications that need the closed patterns of multi-sequence time-series transactions. In the field of seismology, earthquake prediction is still the greatest challenge. Song et al. [51] analyzed the groundwater anomaly before and after the Chi-Chi Earthquake in Taiwan and found that groundwater level and groundwater chemistry can be used to forecast earthquakes. We have doubted whether there is a pattern that corresponds to an unusual event before earthquakes. Thus, we may focus on the factors like changes in groundwater levels, changes in the concentration of anions in groundwater, etc. We could visualize a database with historical earthquake records where each record contains two sequences: the values of ground water level and the values of anion concentration in ground water and these values are recorded on a daily basis for a month before an earthquake takes place. We may apply the CMP algorithm to discover meaningful patterns and expect that these patterns could help seismologists predict earthquakes.
In microelectronics, wafers are the critical components in the fabrication of semiconductor devices. To manufacture such components, many factors are taken into consideration, including temperature and humidity [42]. Manufacturers usually rely on past experience to determine the adequate temperature and humidity in order to produce high quality wafers. In order to address the issue of finding the adequate temperature
4
and humidity, we may apply the CMP algorithm to mine time-series patterns from a database where each record contains both temperature and humidity readings (multiple sequences) during the process of producing a wafer in a semiconductor facility.
Accordingly, based on the frequency of these patterns, we can determine the acceptable range of temperature and humidity to produce the best wafer.
Likewise, weather forecasts are required to predict the possibility of deteriorating weather conditions, especially approaching tornados or hurricanes. However, making such accurate predictions is difficult because many factors are involved, such as temperature, humidity, pressure, etc [2]. If we could recognize what conditions are likely to cause a tornado or a hurricane, we could take actions before the storm arrives, and save a lot of costs or lives.
Therefore, these applications have inspired us to propose an algorithm, called CMP, to efficiently mine closed patterns in time-series databases, where each transaction in the database contains multiple time-series sequences. For example, Fig. 1.1 shows an example database containing three transactions, where each transaction contains two time-series sequences.
Transaction 1 Sequence 1 -2.50 0.33 1.80 -3.90 -0.21 -0.08 Sequence 2 -0.41 -0.25 -0.54 -1.39 0.23 -0.88 Transaction 2 Sequence 1 -7.80 6.52 0.17 -3.68 0.19 0.02 Sequence 2 -0.06 -2.89 0.13 4.30 0.42 -0.47 Transaction 3 Sequence 1 -2.42 5.37 -1.84 -3.55 0.32 -4.64 Sequence 2 -0.42 3.12 0.11 -0.99 0.39 -5.97 Fig. 1.1. A database containing three multi-sequence transactions.
The CMP algorithm consists of three phases. First, we transform every sequence in the time-series database into the Symbolic Aggregate approXimation (SAX) representation [37]. Second, we scan the transformed database to find all frequent
5
1-patterns (the patterns of length one), and build a projected database for each frequent 1-pattern, where the projected database of a pattern comprises all transactions containing the pattern. Third, for each frequent pattern found, we recursively use its projected database to generate its frequent super-patterns in a depth-first search manner.
Moreover, we apply some closure checking and pruning strategies to prune frequent but non-closed patterns during the mining process. By using the projected databases, the CMP algorithm can localize support counting, candidate pruning, and closure checking in the projected databases. Therefore, it can efficiently mine the closed patterns from a time-series database.
Most previous studies [45][61][66] require exact sequence alignments in discovering patterns from sequence databases. However, they are not capable of dealing with noisy environments, such as a set of deviated sequences in a database. In order to address this issue, we incorporate the idea of allowing flexible gaps between items in a pattern. For example, a[0, 3]b is a flexible pattern where the number of gaps between a and b ranges from 0 to 3. Mining such sequential patterns with flexible gaps can provide decision-makers a broader view on data. For example, given a web log recording URL that each user visits at each minute, a web analyzer would like to determine how many users spend less than 5 minutes browsing web page A before switching to web page B.
That can be denoted as a flexible pattern A[0, 5]B, which means that users who are browsing web page A are likely to traverse to web page B within 5 minutes. However, the previously proposed approaches cannot mine such a pattern since they can only mine the patterns to show that users browse web page A before web page B [35].
The subprime mortgage crisis in the United States has had a major impact on house prices and stock markets worldwide [5]. How house price index varies over time since the outbreak of the subprime crisis would become a major concern for consumers. In this case, we could analyze the house price index data to see if there is a significant variation in house price for all cities in the United States. Since the timing of house
6
price declination or acceleration is different among cities, no patterns may be discovered by traditional pattern mining. Dealing with flexibility, we aim to give consumers a big picture on the underlying trend in house prices.
Inspecting whether the crime rate increases or drops in the same trend among different cities in a country provides polices a clue to find the major causes, such as unemployment, economic recession, drugs prevailing, etc. A flexible pattern found could be “south cities increase in crime lagged behind the rise in the north cities’ crime rate for several months when unemployment rate drops sharply”. However, traditional data mining techniques, such as clustering analysis, association rule mining, and sequential pattern mining, are not capable of retrieving such information [9].
Effective inventory control is vital to the success of the business in many industries.
For example, predicting the bestselling books at the right time and ensuring sufficient inventory have a major impact on profitability for a book store. An interesting flexible pattern could be “if a book has been made into a movie, it is likely that the movie will resurrect the sales of the book; the book would become the bestseller in one or two weeks after the movie is in theatres”. Since none of the previously developed algorithms consider flexibility in the mining process, they are not capable of discovering such a pattern.
The problems outlined above have sparked our motivations to extend the problem of closed pattern mining to a second aspect; therefore, the second algorithm, called CFP, is designed to mine closed flexible patterns in a time-series database.
The CFP algorithm has three phases involved in mining closed flexible patterns in a time-series database. Given a time-series database, we first transform it into a symbolic database based on the SAX representation [37]. From the transformed database, we recursively mine closed flexible patterns in a depth-first search manner. Specifically, we first obtain all frequent patterns of length one (frequent 1-patterns) and meanwhile, we build a projected database for each 1-pattern. Then, we grow each frequent k-pattern p
7
by joining it to each nominee in p’s projected database, where a k-pattern is a pattern of length k, a nominee is a frequent 1-pattern in the projected database, and the number of gaps between p and the nominee is bounded by a user-specified maximum gap threshold.
The process is repeated until no more closed flexible patterns can be generated.
Both the CMP and CFP algorithms involve a transformation of time-series sequences into symbolic sequences in the first phase. Although analyzing on symbolic sequences is ideal to reduce the noises and ease the mining process, these approaches may lead to pattern lost and the sequences supporting the same pattern may look quite different.
(a) (b) (c)
(d) (e) (f) Fig. 1.2. Multi-sequence time-series X, Y, and Z.
In the SAX representation, two closed values may fall into different symbols if a breakpoint is set in between them. For example, the values 0.42 and 0.43 are very close to each other but they are assigned to different symbols if the breakpoint is set to 0.43.
Let the breakpoints be -0.43 and 0.43 in the SAX representation, that is, we assign symbol a to each value less than -0.43, symbol b to each value greater than or equal to -0.43 and less than 0.43, and symbol c to each value greater than or equal to 0.43. As
-0.8
X Sequence 1 Sequence 2
b
Y Sequence 1 Sequence 2
c
Z Sequence 1 Sequence 2
b b
X Sequence 1 Sequence 2
-0.6
Y Sequence 1 Sequence 2
-0.8
Z Sequence 1 Sequence 2
8
shown in Figs. 1.2a, 1.2b, and 1.2c, three multi-sequences have the same pattern
⎭⎬
c . However, multi-sequence Z is quite different from X and Y.
To overcome the issue raised in symbolic sequence analysis, we have intrigued to mine closed patterns directly from the raw data. That is, we mine the patterns without any transformation from numerical values to symbols. For the sequences shown in Figs.
1.2a and 1.2b, we will have the pattern
⎭⎬
since both sequences of X and Y are close to each other.
Visualizing data in the multiple resolutions helps decision-makers to make high-quality decisions. The multi-resolution views provided by Discrete Wavelet Transformation (DWT), such as the Haar wavelet transform, improve visualization and make patterns, trends, surprises, and relationships easier to identify [50]. Figs. 1.2a, 1.2b, and 1.2c show time-series X, Y, and Z in the high resolution, respectively, whereas Figs. 1.2d, 1.2e, and 1.2f show the transformed time-series X, Y, and Z in the low resolution, respectively. As observed, different perspectives of data are captured in different resolutions; the time-series in the low resolution show overall trends, whereas the time-series in the high resolution involve some fluctuations and reveal more detailed information. This awareness has motivated us to integrate the concept of multi-resolution visualizations in the mining process.
Therefore, we propose an algorithm, called CNP, for mining multi-resolution closed numerical patterns in a time-series database, where each time-series consists of multiple sequences. Given a time-series database, we first apply the Haar wavelet transform [19] to convert each time-series in the database into a sequence in the low resolution. Second, we find all frequent patterns of length one (frequent 1-patterns) from the transformed database (low resolution). Third, we recursively extend each frequent k-pattern to form frequent (k+1)-patterns. Subsequently, we restore closed patterns back to the high (original) resolution. Finally, we obtain all closed patterns in the low and
9
high resolutions. The advantage of applying a wavelet transform on data is that we can view data from different perspectives by discovering patterns in different resolutions.
Moreover, we have shown that the mining process is speeded up and the final outcomes would be the same as those without wavelet transform.
To inspire the study of the CNP algorithm, let us consider the following examples.
Initially, a hospital has applied the CNP algorithm to visualize the medical data related to disease X and find significant patterns in different resolutions about symptoms, such as a pattern found in the low resolution could be “body temperature movements appear to be cyclic and meanwhile the blood pressure remains high in the first few weeks”, whereas a pattern found in the high resolution could be “body temperature remains above normal for four consecutive days and drops in the normal range for five consecutive days and then increases again. Meanwhile, the blood pressure has little fluctuations throughout the day but shows an increasing trend in overall”. To diagnose whether a patient has disease X, a doctor may first use the pattern in the low resolution to reject the possibility if the patient’s symptoms do not conformed to the pattern.
Otherwise, the doctor may take a closer look on the pattern in the high resolution to confirm the diagnosis.
Let us consider a case of embracing the use of the CNP algorithm in the business world. Corporation W has over hundred fast food chain stores. In order to improve the operational efficiency and assist managers in decision making, data analyzers apply the CNP algorithm to find patterns in a database where each record in the database represents a branch and each record contains two sequences: the number of customers and the delay to fulfill the meal order and these values are recorded on a hourly basis for each day. Since the CNP algorithm is able to generate patterns in the multiple resolutions, decision-makers can focus on a targeted area of their interests. For instance, the chief executive officer (CEO) of the corporation may be only interested in a broader view on data, and hence he/she focuses on the patterns found in the low resolution,
10
whereas the branch managers may concern more detail information in data and thus they concentrate on the patterns found in the high resolution. These patterns may help the corporation to make staffing decisions and improve service delivery efficiency. As seen from the above examples, the study on mining patterns in the multiple resolutions can be used for many real-world applications.
The choice of symbolic or numerical methods depends on the visualizations required by the users or the applications in different areas. For instance, we may apply the symbolic method to discover patterns in the fields of finance or meteorology because users simply demand an overview of trends, whereas in the field of medical science or disaster forecasting (e.g. hurricane), the numerical method is adopted because doctors or experts need rigorous patterns that can help them to diagnose diseases or predict disasters.