Appendix - 利用統計方法自動依工程忍受度判斷機台差異及其在半導體製程改善之應用

Appendix A. Introduction of regression trees

The methodology of classification and regression trees (CART) [22], is a recursive partitioning algorithm to partition data into several homogenous groups.

Classification trees and regression trees are applied for categorical response and continuous response, respectively. The following figure, Figure A.1, makes a brief introduction using a graphical representation to construct the trees. In our research, the type of our response is continuous, so we only focus on the introduction of regression trees in the followings below.

Suppose our data consists of p input variables and a response, for each of N observations: that is, ( , ), x y_i _i i=1, ..., ,N with x_i =(x_i₁, x_i₂, ..., x_ip). The algorithm will automatically decide the splitting variables from and split points.

Suppose first that we have a partition into

1 2

(x_i , x_i , ..., x_ip)

κ

regionsR R₁, , ₂ ..., R_κ

. Then our response model is denoted by

( ) _m ( )

f x c I x R

∑

∈ (A.1) where c_m is a constant in each region.

Figure A.1 Construction of a tree.

If we adopt minimization of the sum of squares

∑

(y_i− f x( )_i )² as the criterion for split rule, we will use to estimate , where is the average of in the region

ˆ_m

c c_m cˆ_m y_i

R . k

ˆm average ( |_i _i _m)

C = y x ∈R (A.2) We will illustrate the regression trees in three sections. We will present how to find the best splitting variable and split point to partition the data at each node in A.1 , how to decide the tree size in A.2, and how to view the tree result in A.3.

A.1. How to find the best splitting variable and split point to partition the data at each node

Start from all of the data and choose a splitting variableX . If the split variable _j

X is continuous variable, then a split point s will define the pair of half-planes j

1( , ) { | _j } and 2( , ) { | }

R j s = X X ≤s R j s = X X_j> . (A.3) s If the split variable X is categorical variable, then we find a split set s and split data _j into the pair of half planes We partition the data into two resulting regions and repeat the splitting process on each of the two regions. Then the process will grow a tree step by step and split the data into several terminal nodes.

A.2. How to decide the tree size

How large should we grow the tree? A large tree might over-fit the data; on the contrary, a small tree might not describe the important structure. Tree size will be a parameter to control the model’s complexity. So how to choose a reasonable tree size is very important in CART algorithm.

Traditionally, there are two ways to prune trees for choosing a tree size. One is pre-pruning, and the other is post-pruning. The algorithm of pre-pruning is setting some criteria to determine how to stop growing the tree from growing, and the algorithm of post-pruning is pruning a tree by some criteria after growing a complete tree. Since the criterion in pre-pruning is difficult in determining the value, we will use post-pruning to choose the tree size in our research.

Cost-complexity pruning is one of the popular pos-pruning methods. Suppose is a tree getting from the splitting method as in A.1.

T the cost complexity criterion is represented by

| |

( ) ( )

k k

C T_α N Q T α T

∑

+ ^× (A.8) where

N is the number of the observation data falling in the region k R . _k

k is the index of terminal nodes on the binary tree T. T is the number of terminal nodes in T, and.

α is the cost-complexity (α ≥0).

For given a cost-complexityα , we can get a subtree T_αof to minimize .

From this formula, we can find the larger value

T C T_α( )

α , the smaller size of subtree T_α that we will get. For given each valueα , we can get a unique smallest subtree T_α. If

, we can get a full tree.

α =0

As the meaning of cost-complexity α is difficult to connect with the concept of engineering tolerance control, so engineers are hard to choose a correct cost-complexity α in order to get a reasonable tree size.

A.3. How to view the tree result

The structure of tree is very important information from regression tree algorithm. We can realize the similarity in our data. The data which belong into the same terminal nodes means they have the highest similarity by regression tree

algorithm. When the data does not belong to the same terminal nodes, it will split into different terminal nodes later if the data is more similar. So we could realize the similarities of tool performances by the tree structure in our research. We illustrate the result of Case I in section 3.1 as an example of how to read the tree structure.

Table A.1. Partitioning results with respect to different values of cost complexity in the CART model for the simulated data in Case I.

Partitioning result Cost-complexity (1,2,3,4,5) 0 (1,1,2,2,3) 1

(1,1,1,1,2) 90

T1 T2 T3 T4

Figure A.2. A tree obtained by the CART model for section 3.1.

From Table A.1, we will get the partition result (12345) when the

cost-complexity is 0; that is each tool is partitioned into different group. If we set

cost-complexity is 1, we will get the partition result (11223); that is the tools T1 and

T2 belong to one group, T3 and T4 belong to one group, and T5 belong to another one.

We also can get the same information from the structure of the tree in Figure A.2. As the similarity of T1 to T2 is higher than that of T1 to T4, the time that T1 and T2 split into different groups is later than that of T1 and T4. As such way, we can understand the similarity among T1, T2, T3,T4, and T5 by the tree structure.

Appendix B. The introduction of Gibbs sampling

Gibbs sampling, also called alternating conditional sampling, is a particular Markov chain algorithm and useful in many multidimensional problem. It is named by Geman and Geman [49], who used it for analyzing Gibbs distributions on lattice.

Nevertheless, the works of Gelfand and Smith [50] and Gelfand et al. [51] introduced Gibbs sampling into the mainstream statistics. To date, most statistical applications of MCMC have used Gibbs sampling.

Suppose ( | )P y θ is the data distribution with d-dimensional parameter vector

1 2

( ,θ θ ,...,θ_d)

θ = and ( )P θ is the related prior distribution, then ( , )P θ y is the

joint density of θ and y with ( , )P θ y = P y( | )θ P( )θ and P( | )θ y is the posterior density with ( , ) ( | ) ( )

( | ) ( | ) ( ).

( ) ( )

P y P y P

P y P y

= θ = θ θ ∝

θ θ θ For

Bayesian inference, our target density is the posterior density P( | )θ y , then we can use Gibbs sampling to construct a Markov chain which will converge to the target density P( | )θ y .

Suppose P(θ_j|θ₋_j,y) is the conditional distribution given all other component of θ , where θ₋_j represents all components of θ , except for θ_j. The illustration about how to construct a Markov chain by Gibbs sampling is as follows:

At each iteration t, we can choose one of the components of θ_j to update.

When we select to update the jth component θ_j of θ , θ^t_jis sampled from the conditional distribution P(θ_j|θ₋_j^t⁻¹,y) where θ^t_j represents the jth component of θ at iteration t and θ₋_j^t⁻¹ represents the all the components of θ , except for

θj, at their current values of the iteration t-1. By repeating such iterations, we construct a Markov chain. If the Markov chain satisfy irreducible and aperiodic properties, then it will converge to our target density P( | )θ y .

Therefore, if we can get the conditional distribution P(θ_j|θ₋_j, )y for j=1, …, d. we could construct the Markov chain by Gibbs sampling.

Appendix C. The derivation of conditional distributions in TCP method

In the Appendix C, we present the inductions for the conditional distributions of all parameters in Θ = {θ₁, θ₂,…,θ_J, σ², µ₁, µ₂… µ_κ, and } based on a given partition g . At the followings, we will use the notation of

τ2

θj

Θ− to indicate the set of all parameters except the parameter of θ_j

C.1 θ_j|

= P( g, µ, τ ², θ , σ², y)/ P( g, µ, τ ², σ², y)

= ²

Then

Appendix D. The derivation of acceptance probability in TCP method

Following the introduction of RJMCMC in Section 2.1, the acceptance probability R of jumping from current Model M _k (that is, ( ,k θ_k ) ) to new model

where is proposal jump probability for a jump from current Model

→ is the proposal jump probability ratio, PBR= ^{( ' |} ^'^{, ', )}

At first, we consider the birth move type: the current partition with

choosing a group which included at least two tools from to split randomly.

Since the length of

g(1)

µuuuv( )

= (µ , ₁ µ , …₂ µ_κ⁽¹⁾ ) also increases by one, we add a new random variable z that is independently distributed as Normal (µ ,_z σ ) for dimension _z² matching. Suppose that we choose the group from to split into two new groups and . Let then we replace these densities in (A.1) according to our data distribution, prior distributions, proposal jump probability and bijection function in TCP, we can have

∪ Sk1

where P g( ^{( )}ⁱ )

( ) 1

( )

{# of partition whose degree }

and the partition .

( )

P . where are the proposal probability for the death move type and birth move type respectively.

, and

death birth

P P

PBR=

{

^{# of}^S^k^whose^w^k ^≥²

}

_{κ κ}⁽¹⁾₍ ²⁽¹⁾ ₊₁₎

{

²^w^k ⁻¹⁻¹

}

_{f z}¹₍ ₎^.

1 1 2

and we get the acceptance probability is min{1, A} for the birth type.

Without loss of generality, since the death move type is the reverse of the birth

move type, we can get the acceptance probability is min{1, 1/A} for the death move

type.

在文檔中利用統計方法自動依工程忍受度判斷機台差異及其在半導體製程改善之應用 (頁 71-92)