RELATED WORK - 關於統計上顯著性差異模式探索之研究

According to our survey, there are no related researches on this significant difference pattern detection problem. In this chapter, we introduce some related work:

The problem of the traditional quantitative research, the indicator of data warehouse to indicator the difference, the deviation detection to detect the pattern that differ from trend and the feature selection is also not a solution for this problem.

2.1. Quantitative research

Quantitative research techniques [11] [23] are part of primary research and the data which are reported numerically can be collected through structured interviews, experiments, or surveys.

Quantitative research is all about quantifying relationships between variables.

Variables are things like weight, performance, time, and treatment. You measure variables on a sample of subjects, which can be tissues, cells, animals, or humans.

You express the relationship between variable using effect statistics, such as correlations, relative frequencies, or differences between means.

“Hypothesis Testing” is the most popular statistics method [11] to analyze the relationship between variables. And the researchers are most concerned about the differences between means, because the difference can immediately and effectively indicate the causality of the subject. If some variable like the gender of students versus the grades of students has a difference which is a statistical significant at a given 1-α confident level, then we said that there is a significant difference between the gender of students versus the grades.

In the traditional quantitative research, the researchers will firstly propose several hypotheses of the subject according to their experiences, and then test the hypotheses one by one to check if there exists a statistic significant in some hypotheses. Hence, it calls a try-and-error manner. The quality of the result using the manner, of course, is in accordance with the hypotheses made by the researchers.

Furthermore, the quantitative research researcher’s aim is to determine the relationship between one thing and another in a population. Quantitative research designs are either descriptive (subjects usually measured once) or experimental (subjects measured before and after a treatment). An experiment establishes causality.

For an accurate estimate of the relationship between variables, a descriptive study usually needs a sample of hundreds or even thousands of subjects; an experiment, especially a crossover, may need only tens of subjects. The estimate of the relationship is unlikely to be biased if researchers have a high participation rate in a sample selected randomly from a population. In several statistical experiments, bias is also unlikely if subjects are randomly assigned to treatments, and if subjects and

researchers are blind to the identity of the treatments. In all studies, subject characteristics can affect the relationship you are investigating and limit their effect either by using a less heterogeneous sample of subjects or preferably by measuring the characteristics and including them in the analysis. In an experiment, they try to measure variables that might explain the mechanism of the treatment. In an unblended experiment, such variables can help define the magnitude of any placebo effect.

2.2. Indicator

Indicator [17] [18] is not used in a discovery-based analysis but is a useful tool to assist exploring the data cube of the data warehouse by OLAP. In order to implement indicators, a complete datacube should be constructed. In real case, building datacubes is very time consuming [1] [2] [4] [5]. Hence, using indicators is a computational expensive task.

The data warehouse could consist with several datacubes or single datacube. For each datacube, it has several records and a star schema to describe the schema of the datacube’s structure. In other word, the star schema can describe the dimensions with concept hierarchy and some measures of the datacube. And, the data warehouse supports an analysis tool: On-Line Analytic Processing (OLAP) [2] [19] [22]. It is a useful tool assistant to user exploring the datacube. OLAP can organize and present data in various formats in order to accommodate the diverse needs of the different analysis approaches. OLAP server provides server operations for analyzing

multidimensional data cube:

Roll-up: the roll-up operation collapses the dimension hierarchy along a

particular dimension(s) so as to present the remaining dimensions at a coarser level of granularity.

Drill-down: in contrast, the drill-down function allows users to obtain a more

detailed view of a given dimension.

Slice: Here, the objective is to extract a slice of the original cube corresponding

to a single value of a given dimension. No aggregation is required with option. Instead, server allows the user to focus on desired values.

Dice: A related operation is the dice. In this case, users can define a sub cube of

the original space. In other words, by specifying value ranger on one or more dimensions, the user can highlight meaningful blocks of aggregated data.

Pivot: the pivot is a simple but effective operation that allows OLAP users to

visualize cube values in more natural and intuitive ways

The data warehouse also supports another analysis tool: On-Line Analytical Mining (OLAM) [2] [6] [12]. It integrates OLAP, data mining and knowledge

discovering on multi-dimensional database structure into OLAM. Hence, OLAM is also called OLAP Data Mining. The OLAM supports several data mining tasks such as the concept description, mining association rules, classification & prediction, and time sequential analysis.

Typically, the data mining algorithm is performed on a single “data mining”

table. This table is produced by using the transformations and aggregations on the base data. Often, we need to generate the single table. This transformation is a key part of the data mining process. Often, it is a manual process and the physically elapsed time for locating, migrating, and transforming data is the orders of magnitude greater than the involved computing time. It is important that effective tools are used to support this process.

However, neither the OLAP nor the OLAM is a discovery-based analysis tool, and it can not detect the significant difference pattern automatically or semi-automatically.

2.3. Deviation Detection

Deviation detection [13] is a research which aims to detect the pattern differed from the predict pattern. They use a mathematic mode to predict the trend of the measures. Then using the difference of the predict trend and measure to determinate

the deviation. For example, if we predict that the height of a man is taller than a woman. Then deviation detection will detect the pattern, there exists a woman is taller than the man. Similarly, if we predict the profit of products at may be $10,000. Then deviation detection will detect the pattern that there exists a product, cell phone, has a large deviation on the profit. It may be $15,000 or $5,000.

2.4. Feature Selection

Feature selection [6] [14] [15], also known as subset selection or variable selection, is a process commonly used in machine learning, wherein a subset of the features available from the data is selected for application of a learning algorithm.

Feature selection is necessary either because it is computationally infeasible to use all available features, or because of problems of estimation when limited data samples (but a large number of features) are present. The latter problem is related to the so-called curse of dimensionality.

Simple feature selection algorithms are ad hoc, but there are also more methodical approaches. From a theoretical perspective, it can be shown that optimal feature selection for supervised learning problems requires an exhaustive search of all possible subsets of features of the chosen cardinality. If large numbers of features are available, this is impractical. For practical supervised learning algorithms, the search is for a satisfactory set of features instead of an optimal set. Many popular approaches are greedy hill climbing approaches. Such an approach evaluates a possible subset of

features and then modifies that subset to see if an improved subset can be found.

Evaluation of subsets can be done many ways - some metric is used to score the features, and possibly the combination of features. Since exhaustive search is generally impractical, at some stopping point, the subset of features with the highest scores by the metric will be selected. The stopping point varies by algorithm.

Two popular metrics for classification problems are correlation and mutual information. These metrics are computed between a candidate feature (or set of features) and the desired output category.

In statistics the most popular form of feature selection is called stepwise regression. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm.

In machine learning, this would typically be done by cross validation. In statistics, some criteria would be optimized.

在文檔中關於統計上顯著性差異模式探索之研究 (頁 14-21)