LITERATURE REVIEW - 改良的迴歸樹在半導體良率提升之應用

2. 1 Using CPD to detect the mean-shift problem

CPD (statistical Change-Point Detection) was proposed by Dr. Wayne A. Taylor (2000a). This method can detect the change-point problem. Refer to this homepage:

http://www.variation.com/cpa/tech/changepoint.html. In this paper, it was introduced by an example for US trade deficit data.

Taylor (2000a) uses the procedure to find out the location of change for execution change analysis. It mainly uses the tools cumulative sum chart (CUSUM) and permutation test.

In this paper, the significant level for permutation tests is 95%. If the test result was significant, Taylor would use CUSUM to find out the location of mean-shift.

Suppose the data is Y ，_i i=1, ..., N, and the significant level is α . At first Taylor calculates the cumulative sums.

1 ( )

i i i

S =S₋ + Y −Y , S₀ = (2.1) 0

Then he calculates the valueS_diff⁰ , where

Taylor repeats the above movement B times by permuting the data, and gets the valuesS_diff₁, S_diff₂, ..., S_{diff B}. By calculating the counts ofS_{diff i} >S_diff⁰ , he gets its

changed in some position. Then he will find out the location of mean-shift by CUSUM and separate the data into two sections.

m max i

S = S (2.2)

At last, he finds out all the changes in this trend repeatedly for each section. The algorithm is as follows.

1. Given the significant level α , all data are named resource data.

2. Take resource data to input data0.

3. Calculate S_diff⁰ =S_max−S_min, where S_i =S_i₋₁+ − . Y_i Y

4. By permuting the input data B times, get B sequences of the new input data.

5. Repeat step 4 ~ step 5 to get S_diff₁, , S_diff₂ ..., S_{diff B}. 6. Calculate the countsS_{diff i} >S_diff⁰ .

7. The significant level is count

100 %

in this section and break.

9. By CUSUM chartS_m = max S_i , to find out the location of mean-shift.

10. Record the location of mean-shift, and partitioned the two input data1 and input data 2.

11. Take input data 1 and input data 2 individually to input data 0, and go to step 3 ~ step 12.

12. Get all of the locations of mean-shift.

2. 2 Using regression trees to detect the mean-shift problem 2. 2. 1 Introduction of regression trees

In 1963, Morgan and Sonquist proposed Automatic Interaction Detection (AID) to get the optimum model by minimizing the mean square error. The regression tree was a development traced back to Morgan (1964), Sonquist (1970), Sonquist, Morgan (1973), Fielding (1977), Van Eck (1980), and Leo Breiman, Jerry Friedman, Charles J, .Stone, Richard Olshen, who proposed CART in 1984.

CART is an algorithm to separate data by using a binary decision tree. The algorithm for the material divides the parental node to two child nodes, using recursive partitioning from top to down to establish a complete tree. The following figure, Figure 5, makes an introduction using a graphical representation to construct the tree.

Figure 5 Construction of a tree

Suppose ( , ), 1, ..., ,x y_i _i i= N with x_i =(x_i₁, , x_i₂ ..., )x_in . The algorithm needs to automatically decide the split points. Suppose we have a partition of K regions

1, , 2 ..., _K

If we adopt minimization of the sum of squares as the criterion for the split rule, we will use ˆc to estimate _m c , where ˆ_m c is the average of _m y in the region _i R . _k

ˆm average ( |_i _i _m)

C = y x ∈R (2.4) We will illustrate the regression tree in three sections. We will say how to find the best point to split the data in 2.2.2 and how to select the tree size in 2.2.3.

2. 2. 2 Partition

We find the best binary partition by minimizing the sum of squares. The goal of the partition is to decrease the error in each group. We seek the splitting variable j and split point s by solving as follows below.

1( , ) { | _j } and 2( , ) { | _j } We partition the data into two resulting regions and repeat the splitting process on each of the two regions. Then the process will split the data into individual sections.

2. 2. 3 Pruning

How large should we grow the tree? A large tree might over-fit the data; on the contrary, a small tree might not describe the important structure. Tree size will be a parameter to control the model’s complexity, so how do we choose the tree size?

Traditionally, there are two ways to prune trees for choosing a tree size. One is pre-pruning, and the other is post-pruning. The algorithm of pre-pruning is setting some criteria to determine how to stop the tree from growing, and the algorithm of post-pruning is pruning a tree by some criteria after growing a complete tree. The criteria of pre-pruning are too short-sighted, however, since a seemingly worthless split might lead to a good split below it. So, we will use post-pruning to choose the tree size in this paper.

Venables and Ripley proposed to choose the terminal node by these two criteria:

1. max ( , ) 0.006 ( )

s ΔR s t ≤ R t , i.e. the sum of squares after a spilt is smaller than the original data by 0.006 times.

2. The size of terminal node is at least 5.

Then this large tree is pruned by using cost-complexity pruning. Suppose T is a subtree of T , and with ₀ T terminal nodes, where T is the number of terminal

the cost complexity criterion is represented by

2. 2. 4 The challenge of using regression trees to detect mean-shift

In Figure 6, we can find the relation between the mean-shift trend chart and regression tree. We can detect the mean-shift by this way, but there are many challenges in this question.

For a given data, and we can plot its trend chart (Figure 7). If using different cost-complexity α in this data, the results of regression tree will different as in Figure 8. Given the same α using different scales to change the data, we can find different results of regression trees in Figure 9. If the number of mean-shifts is more than one, a major mean-shift would dominate the decision of a minor mean-shift if the major mean-shift is large. The challenge of a regression tree is how to give an adequate cost-complexity value for the data.

Figure 6 The relation between the mean-shift trend chart and regression tree

time

y-value

0 20 40 60 80 100

-2-10123

Figure 7 Trend chart of row data

Figure 8 The results of tree with different cost-complexity values are different.

x<94.5|

Figure 9 The results of tree with different scales are different.

alpha=5, data

time

y-value

0 20 40 60 80 100

020406080

Figure 10 A major mean-shift would dominate the decision of a minor mean-shift.

2. 2. 5 Cross-Validation

Cross-validation (Stone, 1974, Stone 1977, and Allen 1977) is the most widely used method for estimating prediction errors in machine learning. Also, it is used in regression trees to choose the optional model.

Suppose Y is a target variable, X is a vector of inputs, and a prediction model ˆ ( )f X has been estimated from a training sample. Then this method estimates

(

^ˆ

)

the model with the other K− parts of the data and then calculate the prediction err- 1 or of the k−thpart of the data, where k=1, ..., K. Let κ: 1, ...,

{

} {

1, ..., K

}

be an indexing function, and let ˆf⁻^k( )x denote the fitted function for removing the th

k− part of the data. Then the cross-validation estimate of prediction error is

( ) If K =N, it equals leave-one-out cross-validation.

Given a set of models f x( , )α indexed by a tuning parameter α , we denote the fitted function for removing the k−th part of the data as ˆf⁻^k( , )x α . Then the cross-validation estimate of prediction error is

( ) to fit the data. Traditionally, tools like five-fold cross-validation or ten-fold cross-validation are widely used to estimate the error. The algorithm is as follows below.

Algorithm of Cross-Validation Tree

In general, an outlier is an observed value that is numerically distant from the rest of the data.

However, an outlier appearance will create many puzzles. First, you must suspect whether this outlier is there because of some kind of mistakes, perhaps such as external factors. And maybe we can consider giving up this outlier, according to the least squares error method principle, as the outlier will change the model if it exists.

such as perhaps some key points existing in this discovery. Therefore we suggest deleting it when we were certain the outlier is due to other reasons.

In regression analysis, there are some points called high leverage points [12], if they have the influence to change the model. We can find that the model has a large change if point A existed in Figure 11.

-1 0 1 2 3 4 5

-1012345

Figure 11 Example of an influential point

在文檔中改良的迴歸樹在半導體良率提升之應用 (頁 15-26)