A classification-based fault detection and isolation scheme for the ion implanter

(1)

A Classification-Based Fault Detection and Isolation

Scheme for the Ion Implanter

Shin-Yeu Lin and Shih-Cheng Horng

Abstract—We propose a classification-based fault detection and

isolation scheme for the ion implanter. The proposed scheme con-sists of two parts: 1) the classification part and 2) the fault detec-tion and isoladetec-tion part. In the classificadetec-tion part, we propose a hy-brid classification tree (HCT) with learning capability to classify the recipe of a working wafer in the ion implanter, and a k-fold cross-validation error is treated as the accuracy of the classifica-tion result. In the fault detecclassifica-tion and isolaclassifica-tion part, we propose a warning signal generation criteria based on the classification accu-racy to detect and fault isolation scheme based on the HCT to iso-late the actual fault of an ion implanter. We have compared the pro-posed classifier with the existing classification software and tested the validity of the proposed fault detection and isolation scheme for real cases to obtain successful results.

Index Terms—Classification, classification and regression tree

(CART), clustering algorithm, fault detection and isolation, ion im-planter.

I. INTRODUCTION

A

N ION implanter [1] is a bottleneck machine in the semiconductor manufacturing process because of its expensiveness; thus, ion implantation is a critical operation for throughput. A damaged wafer due to the malfunction of the ion implanter is not reworkable, hence it significantly affects the yield. Therefore, a real-time fault detection to prevent more wafer damage and a fault isolation to reduce the down time of the ion implanter are crucial issues to the yield and throughput of the semiconductor manufacturing process. There are two categories of fault detection methods, the model-based methods and model-free methods. The model-based methods, which utilize the mathematical model of the plant, originated from chemical process control, aerospace related research, and other areas, have been developed in last three decades [2]–[9]. Model-free methods, which do not use the mathematical model of the plant, range from physical redundancy, limit value checking [10], to spectrum analysis [11], [12]. Among them, the limit value checking method is widely used in practice. There are also two types of fault isolation methods [13], [14], the classification methods and inference methods. If a priori knowledge is not available for the relationships between the measured data patterns and faults, classification methods are used. For example, a neural network, trained using a large set Manuscript received May 31, 2005; revised July 6, 2006. This work was sup-ported in part by the National Science Council, Taiwan, R.O.C., under Grant NSC94-2213-E-009-044.

The authors are with the Department of Electrical and Control Engineering, National Chiao Tung University, Hsinchu 300, Taiwan, R.O.C. (e-mail: sylin@cc.nctu.edu.tw; schong.ece90g@nctu.edu.tw).

Color versions of Figs. 3 and 4 available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSM.2006.883594

of abnormal data pattern and known fault pairs, can be used to classify the corresponding fault of an abnormal data pattern. If there is a priori knowledge for the relationships between faults and measured data patterns, a rule-based expert system can be used to inference the corresponding fault of an abnormal data pattern.

Regarding fault detection, since there does not exist any proper model for the ion implanter, the model-based fault detection methods cannot apply. Thus, the limit value checking method is currently employed in some semiconductor manu-facturing companies. The structure of an ion implanter is shown in Fig. 1 [1]. In general, the equipment supplier provides digital equipment to monitor the proper operation of the scanning sub-system of the machine. The well-trained engineers employ the limit value checking method to investigate the SPC charts [15] of the measured parameters for other major subsystems, such as the ion source (filament), extraction electrode, mass analysis, and acceleration subsystems, to monitor their operations.

The measured parameters can be, for example, filament voltage, filament current, discharge voltage, etc. However, there are several tens to hundreds of recipes1_{for wafer}

fabrica-tion in a semiconductor foundry each day. Although the setting of a scanning subsystem is independent of the recipes, the other four subsystems’ parameters may vary widely due to various recipes. This induces the first drawback of the limit value checking method, that is the difficulty of defining a threshold to distinguish one recipe from the others. Since each recipe involves a combined setting of the four subsystems, this induces the second drawback of the limit value checking method, that it cannot provide combination information of the measured parameters of the four subsystems. In addition, the occurrence of electrical spikes in the ion implanter will make the measured parameters exceed the threshold and indicate a fault situation; however, the electrical spikes are not actual machine faults. This is the third drawback of the limit value checking method. Regarding fault isolation, both classification methods and in-ference methods require a fairly large set of the abnormal data patterns with known faults to train a classifier and construct a rule-based expert system, respectively. Collecting a large set of abnormal data patterns with known faults in an ion implanter is very difficult, because there are several hundreds of steps in fabricating a chip and the chip failure is most probably known when it is under test. To find out which step in the complete manufacturing process causes the failure is already difficult, 1_{A recipe controls how settings are initialized or changed during a process} step. Examples include recipe numbers which index tables of set points in fur-naces or written instructions to operators. A recipe is usually considered con-stant during any one process step. In this paper, a recipe corresponds to a specific product of integrated circuit.

(2)

Fig. 1. Structure of ion implanter.

not mention the collection of a large set of abnormal data patterns with known faults due to ion implantation. Thus, the purpose of this paper is to propose an automatic (i.e., no need for well-trained engineers) and effective tool to monitor the above-mentioned four subsystems as a whole and generate a warning signal once a machine fault occurs and isolate that fault.

To overcome the first two drawbacks of the limit value checking method, we should be able to identify the recipe of the working wafer from the measured parameters of all the four subsystems. This makes the data mining technique [16] attractive. To overcome the third drawback of the limit value checking method, we need to distinguish electrical spikes from the actual machine faults. Motivated by the these con-siderations, we propose a classification-based fault detection and isolation scheme for the ion implanter. Viewing a recipe as a class, we can classify the recipe of the working wafer based on the corresponding measured parameters of the four subsystems. Thus, the overall structure of the proposed fault detection and isolation scheme can be shown in Fig. 2. Our scheme starts by classifying the recipe of the working wafer based on the measured parameters. If the classified recipe of the working wafer matches its destined one, we assume there is no fault and proceed with next wafer. This no-fault assumption may cause only a few damaged wafers in the worst case. A detailed analysis of this claim will be addressed in Section IV. On the other hand, if the classified recipe does not match its destined one, a double check of the recipe command should be carried out. If the command is wrong, the operator will be informed; otherwise, the warning signal generation criteria will be tested. If the criteria is satisfied, we conclude that there is a machine fault and a warning signal will be generated; otherwise, we will proceed with next wafer. Once a warning signal is generated, we will perform the fault isolation scheme to isolate the fault. In short, the proposed fault detection and isolation scheme consists of three major problems. The first one is a classification problem, which is to classify the recipe of a working wafer. The second one is a fault detection problem, which is to determine whether there is a machine fault and generate a warning signal if there is one. The third one is a fault isolation problem to determine which subsystem has a fault. In this paper, we propose a hybrid classification tree (HCT) with good learning capability to deal with the classification problem. The HCT combines a proposed clustering algorithm

Fig. 2. Proposed fault detection and isolation scheme for ion implanter.

with the classification and regression tree (CART) [17], [18] to take advantage of the specific setting of a recipe during a process step. Its good learning capability will enable it to work on line. Since the operator should interrupt wafer processing immediately when a fault is detected, a high standard in the accuracy of fault detection is required to not unnecessarily degrade the throughput. Thus, to account for the possible inaccuracy caused by the HCT, we propose a warning signal generation criteria to deal with the fault detection problem. This criteria aims to minimize the probability of false alarm when there is no fault as well as the probability of no alarm while a fault exists; the former tries to eliminate the indicated fault situations due to electrical spikes and classification errors, while the latter tries to find out the hidden machine faults when a classified recipe matches the destined one. However, we need not worry about the latter one by the no-fault assumption men-tioned previously. To cope with the fault isolation problem, we propose an HCT-based fault isolation scheme. The basic idea of this scheme is to find the parameter (or parameters) that causes the classification errors. Unlike the existing methods, which need to collect a fairly large set of measured data patterns with known faults, as indicated, the proposed fault isolation scheme almost spends no extra effort as will be seen in Section III-B.

(3)

From here on, we will use the terminologies attribute and data pattern in classification techniques to represent the parameter and data of the measured parameters of the four subsystems of the ion implanter, respectively.

We organize our paper in the following manner. In Sec-tion II, we will present the HCT and its learning capability. In Section III, we will analyze the probability of no alarm, while a machine fault exists to verify the no-fault assumption, and present the criteria for generating the warning signal. We will also present the fault isolation scheme in this section. In Section IV, we will apply the HCT to real data sets to obtain the -fold cross-validation classification errors, based on which we will demonstrate the validity of the proposed warning signal generation criteria and the fault isolation scheme. We will also investigate the learning capability of HCT by reporting the computation time needed to update the classification rules of HCT. In Section V, we conclude.

II. HCT A. Introduction

There exist numerous classification techniques for classifica-tion problems of continuous attributes such as the neural net-work approach [19], [20], maximum-likelihood approach [21], [22], fuzzy set theory-based approach [23], [24], decision tree [25], [26], CART [17], [18], kernel-based learning algorithms [27], and recent methods like random forests [28], multiple ad-ditive regression trees (MART) [29], [30], and the boosting flex-ible learning ensembles with dynamic feature selection tech-nique [31]. Among them, the neural network approach is su-perior in the aspects of free data distribution and free data im-portance; however, they are computationally expensive and pro-duce variable results due to the random initial weights. The max-imum-likelihood approach was the most widely used method in classifying remotely measured data; however, its performance was degraded when the target classes could not be adequately described by the statistical model. The fuzzy set theory-based approach had been successfully applied to the pattern classifica-tion problem; however, the computaclassifica-tional complexity is raised when the number of classes as well as the number of attributes is large. A decision tree is mainly designed for classification of discrete variables. However, CART can handle continuous attributes. Compared with random forest, MART, and boosting flexible learning ensembles with dynamic feature selection tech-niques, the disadvantage of CART is inaccuracy due to its nature of piecewise constant approximation. However, the biggest ad-vantage of CART is its interpretability; whereas, the previously mentioned three methods and the kernel-based learning algo-rithms are thought to lack this feature. The interpretability is the key feature of our HCT-based fault isolation scheme, how-ever, at the expense of some classification accuracy. Fortunately, the decrease in accuracy will be remedied by the warning signal generation criteria by applying it to the fault detection of the ion implanter, which will be presented in Section III-A.

The tree sizes of CART are closely related to the inter-pretability and accuracy. A small tree can be easily interpreted, while the interpretability of a large tree is questionable. On the other hand, a larger tree is more accurate than the smaller

(a)

(b)

Fig. 3. (a) Separable recipes. (b) Nonseparable recipes.

one. Thus, to retain the interpretability of a small tree while keeping the accuracy of a large tree, we intend to propose a preprocessing step to reduce the tree size of CART to improve the interpretability while keeping its classification accuracy. In general, a recipe may contain various steps, and a recipe step remains constant during the processing of one wafer; however, different attributes (parameters) may be ramped during the entire processing step. Nonetheless, some (not all, as can be observed from the experimental results shown in Fig. 10) attributes’ mean of each individual recipe step is still a key to distinguishing the recipes. Thus, we can exploit this property to fulfill the objective of preprocessing. To do this, we propose a separation matrix-based clustering algorithm as a preprocessing step for CART. This clustering algorithm will classify the whole data set into a clustering tree and the classes in the leaf clusters will be classified by the CART. Because both the size and the number of classes of the leaf cluster are much smaller than the original data set, the computational complexity of CART can be improved.

B. Separation Matrices-Based Clustering Algorithm

Due to the previously mentioned property of a recipe during a processing step, we can investigate the separability between two recipes through the degree of overlapping of the attribute values. For example, suppose the probability density function of an attribute for the two recipes and is as shown in Fig. 3(a). Then, these two recipes are separable based on that attribute; while in the case of Fig. 3(b), the two recipes are not. Throughout this section, we will use the terminology class in classification techniques to represent recipe.

1) Chebyshev Inequality-Based Separation Matrices: We let denote the separation index between classes and based on attribute and define

if and are separable

based on attribute otherwise.

(4)

Fig. 4. Illustration of separation betweenC and C based on p .

Clearly, and for

any attribute . The value of is computed using the Chebyshev inequality [32]. We let the random variable de-note the th attribute of class and let and denote the mean and standard deviation of , respectively. Let be a

positive real number such that , where

denotes the probability of the event , and is a small real number representing low probability, which is usually set to be 0.05. The value of corresponding to a given can be

calculated by setting using the Chebyshev

in-equality. Without loss of generality, we can assume .

We let , where is

as defined and is a very small positive real number to avoid the denominator of the square term being zero or negative. is an

upper bound of based

on the Chebyshev inequality. If is sufficiently larger than , will be very small, which implies the overlapping of the classes and on attribute will be very small; conse-quently, the classes and are more likely to be separable, as illustrated in Fig. 4. Therefore, we can define a threshold value , such that the separation index for classes and can be calculated by

if

otherwise. (2)

Now we can define as the separation matrix for all

classes based on attribute , whose th entry is .

2) Splitting Cluster Using Separation Matrices: We let Cr denote the root cluster, which represents the whole data set. Treating each class in Cr as a node, we can view

as an incidence matrix for all nodes in Cr based on attribute . That means nodes and will be connected by an arc if . The graph constructed based on a separation matrix is called a separation graph, which may contain separate connecting subgraphs. Each connecting subgraph represents a cluster of nonseparable classes based on attribute , and the number of disjoint subgraphs represents the number of disjoint clusters that can be split from Cr using attribute . For ex-ample, the separation graph constructed from the separation ma-trix given in Fig. 5(a) is shown in Fig. 5(b), which consists of two disjoint clusters or two separate connecting sub-graphs and . The resulting clusters can be further split by other attributes. For example, cluster in Fig. 5(b) can be

fur-ther split by attribute , whose is shown in Fig.

6(a), in the following manner. Collecting the rows and columns

of corresponding to the classes in cluster forms

(a)

(b)

Fig. 5(a). Separation matrix example[D(C ; C ) ]. (b) Separation graph ex-ample resulting from separation matrix in (a).

(a)

(b)

(c)

Fig. 6(a). Separation matrix[D(C ; C ) ]. (b) Submatrix of [D(C ; C ) ] corresponding to cluster in Fig. 5(b). (c) Clusters split from cluster using submatrix shown in (b).

the submatrix shown in Fig. 6(b). Repeating the same process of splitting Cr using , cluster can be split into two clusters and , as shown in Fig. 6(c), by using the sub-matrix shown in Fig. 6(b).

3) Choice of Attributes for Cluster Splitting and Construc-tion of Clustering Tree: Because the separaConstruc-tion matrix has al-ready indicated a certain distribution of the attribute values of all classes, we can employ a coarser partition like fuzzy intervals to classify the disjoint clusters instead of treating each contin-uous value as a discrete one like CART. In general, for a given

(5)

(a)

(b)

Fig. 7(a). Separation matrix [D(C ; C ) ]. (b) Separation matrix [D(C ; C ) ].

range of attribute values, finer fuzzy partition is needed to clas-sify a cluster with a larger number of classes. In other words, for a given fuzzy partition and the range of attribute values, the classification will be more accurate for a cluster with a smaller number of classes. Considering that any inaccurate cluster ting will influence the accuracy of the subsequent cluster split-ting along the tree path, we set the criteria for choosing the at-tribute to split a cluster as minimizing the multiplication of the average number of classes and the variation of the number of classes in the resulted child clusters. This criteria implies that the attribute which results in more child clusters and smaller variation in the number of classes in the child clusters is pre-ferred. For example, for the separation matrices of two attributes shown in Fig. 7(a) and (b), suppose that we use the attribute to split the cluster first; we obtain three child clusters. One consists of one class, and the other two consist of three and four classes. If we use attribute first, we will obtain four child clusters, and each child cluster contains two classes. Based on the cri-teria indicated, we would choose to split the cluster. To put this criteria into a mathematical form, we let and Cr denote the number of child clusters and the number of classes in the th child cluster resulted from using attribute to split the cluster Cr , respectively. Then, the criteria for choosing the attribute for splitting Cr is

Cr

Cr Cr (3)

where the first term inside the big bracket represents the average number of classes in the resulting child clusters and the second term denotes the variance of the number of classes in the

re-sulting child clusters, where Cr Cr .

Now, our algorithm for choosing the splitting attribute to build the clustering tree can be stated as follows.

Algorithm I: Choose the splitting attributes and build the

clustering tree

Step 0) Given the original data set Cr and the separation matrices of all attributes. Set Cr as the root cluster and define the set of yet split clusters (YSC) as

YSC Cr .

Step 1) For each cluster in YSC, obtain the corresponding splitting attribute based on criteria (3). Use the obtained attribute to split the cluster, and put the resulting child clusters into YSC. Discard the clusters that had been split and the clusters that cannot be split using any attribute.

Step 2) If YSC , stop; otherwise, return to Step 1). Fig. 8 shows a clustering tree built by using the separation matrices of two attributes shown in Figs. 5(a) and 6(a) to split

the root cluster Cr . Algorithm I

uses three iterations to build the tree. The splitting attribute for each cluster and the progression of YSC are also shown in this figure.

We define the leaf cluster in the clustering tree as the terminal cluster (TC). Each TC may contain one class only or several classes, which cannot be split further using any attribute. For the purpose of classifying a new data pattern into a TC, we need to use the splitting attributes to construct the cluster splitting rules for each cluster in the clustering tree based on the fuzzy rules [33], [34] for a single attribute as presented in the following section. It should be noted that the fuzzy rules employed here are for single attribute; thus, we can circumvent the computational complexity of the fuzzy set theory-based approach as indicated in Section II-A.

4) Clustering Algorithm: The separation matrix-based clustering algorithm consists of two parts: the training part and the classification part. The training part, which is prepared for the classification part, consists of three steps: 1) construction of the separation matrices for all attributes; 2) determining the cluster-splitting attribute and building the clustering tree; and 3) throughout the clustering tree, generate the fuzzy if–then rules needed to classify a data pattern into a proper child cluster based on a given set of training data patterns with known TCs. Of three steps, 1) and 2) have been presented in previous sections. The details of 3) as well as the classification part are described as follows.

a) Fuzzy-rule generation procedures of clustering algo-rithm: The fuzzy rules for splitting a non-TC cluster using the corresponding splitting attribute in our clustering algorithm are of the same type. Thus, for the sake of explanation, we will focus on generating the fuzzy rules for one cluster in the clustering tree. We let Cr denote a non-TC cluster and denote the corre-sponding splitting attribute. We let , , denote the

(6)

Fig. 8. Example of using Algorithm I to build clustering tree.

th attribute of data patterns, and , , from

known child clusters, CCr CCr . These data patterns form the training data set for splitting Cr . The fuzzy rules for splitting a cluster Cr are of the following type.

For , where denotes the number of

fuzzy partitioned intervals on the range of the th attribute values

Rule Cr If is then the belongs to

CCr with where is the th partitioned

fuzzy interval, CCr is the consequent, i.e., one of the child clusters, and is the grade of

certainty of rule Cr (4)

What need to be determined in the previous rule are CCr and , and the procedures for determining them are called fuzzy rules generation procedures for splitting one cluster, described as follows.

Let be characterized by the nonnegative fuzzy

member-ship function . The membership function can be

triangular, Gaussian, or any other shape. In this paper, we con-sider the triangular membership function. Then, can be considered as the grade of compatibility of with respect to

. We define

Cr (5)

as the sum of grade of compatibility of child cluster CCr with respect to . Then, the algorithm for generating fuzzy rules for splitting cluster Cr can be stated as follows.

Algorithm II: Generation of the fuzzy rules for splitting cluster Cr

Step 0) Given training data patterns , ,

with known child clusters Cr , of the to-be-split cluster Cr and the corresponding splitting attribute . Set . Step 1) Calculate the sum of the rade of compatibility

of child cluster CCr , , with

respect to by (5).

Step 2) Find the child cluster CCr such that

CCr Cr

CCr Cr CCr Cr (6)

then CCr is the consequent CCr in rule Cr .

Step 3) _Determine _{, the grade of certainty of rule} Cr , by

CCr Cr

Cr Cr (7)

where Cr

CCr CCr CCr Cr

denotes the average of the sum of grade of compatibility of the rest of the child clusters with respect to .

Step 4) If , stop; else, set , and return

to Step 1).

b) Training part of clustering algorithm: Combining the construction of separation matrices, determination of the split-ting attributes, building of the clustering tree, and the fuzzy rule generation procedures, we are ready to summarize the training procedures of the clustering algorithm using the training data set.

Algorithm III: Training procedures of the clustering

algo-rithm

Step 0) Given a set of training data patterns with known

classes; compute and of each class and

each attribute ; compute the separation matrices based on (2) for each attribute . Step 1) Apply Algorithm I to obtain the splitting attributes

and build the clustering tree.

Step 2) Use Algorithm II to generate the fuzzy rules for each cluster in the clustering tree.

(7)

c) Classification part of clustering algorithm: Once the fuzzy rules for splitting the clusters in the clustering tree are generated, we can determine the child cluster to which the new data pattern belongs at each cluster based on a fuzzy reasoning method.

Let the new data pattern be and let be the th attribute of corresponding to the splitting attribute at cluster Cr . We define CCr , the weighting grade of certainty of with re-spect to the child cluster CCr , as the sum of the multiplication of the grade of compatibility of with respect to and the grade of certainty of rule Cr over all trained rules the consequent of which are CCr . We can express CCr

mathe-matically as CCr Cr CCr CCr .

Then, the classification procedures for the new data can be stated as follows.

Classification Procedures: The child cluster CCr , with respect to which the weighting grade of certainty of is maximum, is the concluded cluster of , that is,

CCr _CCr _CCr .

Now, the classification procedures for classifying a new data pattern into a TC can be stated.

Algorithm IV: Classification procedures of the clustering

al-gorithm

Step 0) Given a new data pattern ,

where denotes the total number of attributes; set Present Cluster PCr Cr .

Step 1) Use , where corresponds to the attribute used for splitting the PCr, and the classification procedures stated above to classify into a child cluster of PCr, we denote this child cluster by CPCr. If the CPCr is not a TC, set PCr CPCr and repeat this step; otherwise, stop.

C. CART for TC

The TCs resulting from the training part of the separation matrix-based clustering algorithm may consist of one or more classes. Since the number of classes and the size of the cor-responding data set in each TC should be much smaller than Cr , it will be computationally much easier to apply CART to classify the TCs and the resulting tree size of each TC will be much smaller. Therefore, our clustering algorithm helps reduce the computational complexity and the tree size of CART when applied to Cr alone.

The CART is a well-developed classification tool. The details of this classification technique can be found everywhere [17], [18]. Similar to the proposed clustering algorithm, CART also consists of training and classification parts. The training part of CART is to build a classification tree and the splitting rules in each node. In brief, the construction of a CART classification tree and splitting rules centers on three major elements: 1) the splitting rule; 2) the goodness-of-split criteria; and 3) the criteria for choosing an optimal or final tree for analysis. Regarding 1), there are three major splitting rules in CART. The one we em-ployed here is the Gini’s criteria [17]. This criteria starts the tree-building process by partitioning the TC into binary nodes based upon a very simple question of the form: “is ?”

where is an attribute and is a real number. Regarding 2), the CART uses a computation-intensive algorithm that searches for the best split at all possible split points for each attribute that decreases the Gini’s impurity measure [17] most. CART will recursively apply this splitting rule to split nonterminal child nodes at each successive stage. In order to reduce the complexity of the built tree which is measured by the number of its terminal nodes, CART uses a pruning process to find an optimal tree, as pointed out in 3). The computational complexity of the training part of CART mainly lies in the exhaustive search for the best split required in 2). Once the classification tree and the split-ting rules are obtained, the classification procedure of CART is simply asking whether to determine which of the binary child nodes the new data pattern belongs to throughout the clas-sification tree.

D. Classification of New Data Pattern

Once the training part of the HCT, which combines the training parts of the clustering algorithm and CART, is com-pleted, we are ready to use the classification procedures of both clustering algorithm and CART to classify a new data pattern, as required in the first two blocks in Fig. 2.

E. Learning Capability

The learning capability of a classifier is very important in cur-rent application, because for every 14 min, 24 wafers (or a lot) of the same recipe will be ion implanted. Thus, new data pat-terns arrive with a high frequency. For the sake of explanation, we can assume the recipe of the working wafer is one of the recipes under work, because only a slight modification is needed for the case of a new recipe. The learning of HCT after the new data pattern joins in consists of two parts. The first part is for the clustering algorithm and the second part is for CART. Learning of the clustering algorithm consists of three updating steps: 1) updating the separation matrices; 2) updating the attributes used to split clusters as well as the clustering tree; and 3) updating the fuzzy rules for splitting clusters. Learning of CART is just to up-date the best split for each node in the classification tree.

1) Learning of Clustering Algorithm: Since the new lot of wafers is of the same recipe, the new data patterns will be used to update the mean and variance of each attribute of the corre-sponding class. Denoting the class index of the new data pattern by , we will update and for all , which will be used to update the separation indexes for all , all .

Sup-pose the updated do not change for all and all ,

then the separation matrices remain the same; consequently, the splitting attributes for clusters and the clustering tree also remain the same as can be observed from Algorithm I. This implies that if the separation matrices are unchanged after the new data pat-tern joins in, the updating step 2) can be skipped. In fact, and will only slightly deteriorate when the new data patterns join in because of the large amount of training data set. This implies that the updated separation matrices may change only when the amount of accumulated new data patterns are large enough. On

the other hand, suppose changes for any and

and causes the corresponding separation matrices to be changed in updating step 1); we need to proceed with updating step 2) by performing Algorithm I (i.e., Step 1 of Algorithm III) to update the splitting attributes and the clustering tree.

(8)

To update the fuzzy rules indicated in the updating Step 3), we also consider two cases. In the case of unchanged separation ma-trices, which implies the clustering tree and splitting attributes remain the same, we only need to update the fuzzy rules for the clusters in the tree path of the clustering tree, in which the new data pattern belongs. To do so, we let Cr be a non-TC cluster in this tree path and let CCr be the child cluster of Cr in this tree path. To update the rules Cr in (4) is to update the re-sult and grade of certainty after the new data patterns join in. To update the consequent, we need to update CCr Cr first. To do so, we need to add an extra term of the nonnegative membership function of the new data pattern on the right-hand side of (5). The updated CCr Cr will be larger than the original one. Thus, according to Step 2) of Algorithm II, the consequent will not be changed. Subsequently, we can use the

updated _CCr Cr to update the corresponding

by (7). Thus, in this case, updating fuzzy rules is an easy task because the length of a tree path in the clustering tree is usually short. In the case where the clustering tree or any splitting at-tributes change due to the changed separation matrices, we need to perform Step 2) of Algorithm III, which is Algorithm II, to update the fuzzy rules. Of course, this is more complicated than in the previous case. However, no matter what case, it will not affect HCT to work in real time and online as will be demon-strated in Section IV.

2) Learning of CART: Following previous discussions, there are also two cases for updating the splitting rules of CART. The first case is a subsequent situation of the unchanged separation matrices such that the TCs of the clustering tree do not change. Since the number of training data patterns is very large, the best split point of each attribute in each node of the CART will alter, at most, slightly when new data patterns join in. Therefore, we need not exhaustively search for the split point of each attribute. Instead, we can search for the split point only within a window of the original best split point of each attribute. The window is set to be discrete points at the best split point of each attribute. This will, of course, save a lot of computation time. In addition, we need only update the splitting rules for just one TC, to which the new data pattern belongs. The other case is when the separation matrices change and cause the clustering tree changes. In this case, we will rerun the CART for all TCs. As indicated at the end of previous section, this will not affect HCT to work in real time and online.

III. WARNINGSIGNALGENERATION ANDFAULTISOLATION

A. Warning Signal Generation

In general, the ion implanter will be stopped whenever there is a warning signal so as not to damage the subsequent wafers. However, this reaction will be justified only when the warning signal is absolutely correct; otherwise, the throughput will be degraded. Thus, to minimize the probability of false alarms should be one of the objectives. On the other hand, thousands of wafers may be damaged if any fault is not detected. Thus, to minimize the probability of overlooking a fault is another objective. In general, a matched classification result implies: 1) the machine is in normal condition or 2) the actual implantation has been wrong due to a machine fault but a misclassification

makes the classified recipe match the destined one. Case 2) indicates a fault situation that cannot be observed from the matched result. We let denote the misclassification rate of recipe , which can be calculated by

(8)

where denotes the prior probability of recipe and denotes the misclassification rate of classifying recipe to be recipe . If Case 2) occurs to recipe , then the probability of a series of such events occurring is . This indicates the prob-ability of an undetected machine fault will be extremely small, provided that is small and is large. This also implies that the matched recipe will eventually mismatch, provided that the matched result is due to a misclassification. Real values of for all based on HCT will be given through the tests presented in Section IV. This addresses the comment cited in Section I that we need not check the existence of a machine fault when the classified recipe matches the destined one, and the cost of such a reaction is at most damaged wafers, where is a positive in-teger that makes extremely small. This also indicates when the classified recipe matches the destined one, we can continue for the next wafer as shown in Fig. 2. Thus, using the classifica-tion accuracy of the HCT as the basis of generating a warning signal, our objective can be simplified to minimizing the prob-ability of a false alarm.

There are two causes of false alarms. One is the electrical spike and the other is the classification error. Both cases will cause the classified recipe to mismatch the destined one and re-quire the checking of warning signal generation criteria as in-dicated in Fig. 2. To minimize the probability of a false alarm due to an electrical spike, we should distinguish an electrical spike from a machine fault. The electrical spike is only tem-porary, which may affect one or two wafers only, while the ma-chine fault will last until it is fixed. Thus, an easier way to distin-guish them is checking whether a series of classification errors occur. In other words, if there are more than, say, four consec-utive classification errors, the causes of the errors should not be the electrical spikes. Similar reasons apply to the classification errors. We let denote the classification error rate, which is de-fined as (number of misclassified wafers/number of test wafers) 100% of recipe obtained using k-fold cross validation. Then, the probability of the occurrence of consecutive classification errors is , which decreases sharply when increases. Thus, an easier way to distinguish the classification error from the ma-chine fault is also checking whether a series of classification er-rors occur. To achieve this, we can predetermine a very small positive real number , a probability indication of an event that is almost not possible to occur. Then, if , we can con-clude that the cause of the mismatched recipe is not classifica-tion errors. Thus, we can state our warning signal generaclassifica-tion criteria as follows.

Let the classification error rate of recipe be obtained using k-fold cross validation denoted by , and let denote the number of consecutive working wafers; then, the proposed criteria for generating a warning signal is as follows.

(9)

Assume the classified recipe of the th wafer matches the destined one, while the

wafers do not; the warning signal will be generated

at the wafer provided that the following two

conditions hold:

(9) and

(10) where denotes the destined recipe of the wafer, denotes the of the wafer, is a very small positive real number, and denotes the maximum number of con-secutive wafers that can be affected by the electrical spikes. If (9) holds, we can exclude the possibility of false alarm due to classification errors. If (10) holds, we can exclude the possibility of false alarm due to the electrical spikes.

B. Fault Isolation

To eliminate the machine fault, we need to isolate the fault first. In general, when there is a fault in a subsystem, the attribute (or attributes) corresponding to that subsystem may become ab-normal. Thus, the basic idea of our fault isolation scheme is to find the attribute(s) that causes the classification errors, and this can be easily done in a single-tree classifier like CART and HCT, which is their biggest advantage, the interpretability. In fact, the tree structure of HCT is much simpler than CART, because it largely reduces the tree size of CART by using the clustering tree to separate the whole data set into several TCs. Thus, if the misclassified recipe and the destined recipe belong to different TCs, we can use the clustering tree to find the faulty attribute. If they belong to the same TC, we will use the corresponding CART to find the faulty attribute. Considering that the machine fault may occur abruptly or develop gradually, and there may be single or multiple faulty attributes, we will find the faulty at-tribute(s) for each misclassified wafer by the aid of its tree path and the tree paths of several latest correctly classified wafers of the same destined recipe. Thus, once a warning signal is gener-ated, our fault isolation scheme will proceed as follows.

Step 1) Collect the consecutive misclassified wafers that cause the warning signal, i.e.,

such that (9) and (10) hold.

Step 2) Collect the latest correctly classified wafers, which have the same destined recipes as the wafers in Step 1).

Step 3) For each of the wafers in Step 1) and each of the wafers in Step 2), we will find the faulty attribute(s) that causes the misclassification as fol-lows.

3.1) Suppose the two wafers belong to different

TCs, say and , we will use the

clustering tree to find the faulty attribute by tracing the tree paths backward from the cor-responding TCs. These two paths will meet at a node whose splitting attribute will be the

Fig. 9. Using clustering tree to find faulty attribute.

faulty attribute. As illustrated in Fig. 9, the faulty attribute is .

3.2) Suppose the two wafers belong to the same TC and they lie in two different terminal nodes of the corresponding CART, we can find the faulty attribute in a similar manner as in Step 3.1) by using the classification tree of CART.

Step 4) List all the different faulty attributes found from the searches in Step 3) and calculate the corre-sponding probability, based on the frequency of oc-currences. Indicate the corresponding subsystem of the faulty attributes and calculate the corresponding probability by adding the probabilities of the faulty attributes in this subsystem.

IV. TESTRESULTS OFHCT, WARNINGSIGNALGENERATION, ANDFAULTISOLATION

A. Test Results of HCT

In general, there are quite a few attributes that can be mea-sured from the ion implanter; however, not all attributes are helpful in classification. According to the domain knowledge,

the following 12 attributes, and , are

recom-mended: filament voltage, filament current, discharge voltage, discharge current, extraction electrode voltage, extraction elec-trode current, acceleration/deceleration voltage, magnetic field strength, high-voltage power supply current, beam current, beam-line pressure, and chamber pressure, respectively. These 12 attributes cover the four subsystems of the ion implanter. Table I shows the units and related subsystems of the above 12 attributes. We have made all the tests on a 26-recipe case and a 42-recipe case. Due to the page limitation, we will present the complicated 42-recipe case only.

A data set of the 42-recipe case and each recipe consists of a thousand to 10 000 wafers supported from a local world-renowned foundry. We use them to test the classification ac-curacy of the proposed classifier HCT and to demonstrate the validity of the warning signal generation criteria and fault isola-tion scheme. It takes 1 s to measure a 12-attribute data pattern. The ion implantation time for a wafer is around 10 s. Thus, ten data patterns are taken while a wafer is under work. The wafer changeover time is 26 s, on average. Each lot contains 24 wafers, and the setup time for a new lot is 13 min. For all the measured data patterns in this case, we randomly divide them by wafer base into ten parts. We take nine parts as training data set and

(10)

TABLE I

UNITS ANDRELATEDSUBSYSTEMS OF12 ATTRIBUTES

Fig. 10. Cluster splitting tree of 42-recipe case.

one part as test data set. We set in (2), the number of fuzzy partitioned intervals , and a triangle nonnegative membership function for in Algorithm II. Applying Algo-rithm III to the training data set, the resulting clustering tree and the splitting attributes are shown in Fig. 10, where each cluster is denoted by a block, and the recipes contained in a cluster are shown inside the parenthesis in each block. The attribute used for a splitting each cluster is indicated at the outgoing branch in the clustering tree. The corresponding fuzzy rules for each split-ting attribute are also obtained. There are five TCs, and each TC consists of more than one recipe except for the one consisting of recipe 23 only. Subsequently, we apply CART to the four TCs and build the classification tree and splitting rules for each TC. We then use the part of the test data to test the trained HCT using Algorithm IV of the clustering algorithm and the classification tree and splitting rules of CART. Since each wafer corresponds to ten measured data patterns, and each test data pattern will be classified to a recipe, a majority voting scheme is used to con-clude the classified recipe of the wafer corresponding to the ten

test data patterns. Repeating this process for ten times by circu-lating the training data set and test data set, Table II shows the resulting tenfold cross-validation classification error rate of all recipes in this test. We also indicate the tenfold cross-validation classification error rate using the software See5 [35] and CART [17] in this table. From this table, we can calculate the sum of classification error rates of the proposed HCT with

is around 0.2955%; while the sum of classification error rates using See5 and CART are 0.53% and 0.6427%, which are 80% and 117% more than that of HCT, respectively. Thus, HCT ob-tains a very successful classification result.

Remark: From the test results shown in Table II, we see that the superiority of HCT over CART and See5 is mostly due to the zero classification errors of recipe 14. What causes the clas-sification errors of recipe 14 in CART or See5 is the overlap-ping of the attribute data between recipes 14 and 20. Thus, some test data patterns of recipe 14 may be classified to be recipe 20 in CART or See5. Fortunately, in HCT, recipes 14 and 20 have been classified into different TCs as can be observed from Fig. 10. This drastically reduces the possibility of classifying recipe 14 to be 20. However, in HCT, recipe 20 may still be clas-sified into recipe 14, which can also be observed from Table III in misclassification rate. Excluding recipe 14 from the data set, we repeat the complete training and test process, and the results show that the sum of classification error rates of HCT, CART, and See5 are 0.173, 0.248, and 0.182, respectively. Indeed, the three sums of classification errors are closer; however, HCT is still the best among them. Furthermore, we also apply the three classifiers to the 26-recipe case that we mentioned at the begin-ning of this section, and the sum of classification error rates of HCT, CART and See5 are 0.225, 0.577, and 0.405, respectively. For this 42-recipe case, we also obtain the misclassification rate defined in (8) for the three classifiers as shown in Table III.

The largest misclassification rate of HCT, ,

is 0.0043% and the sum of misclassification rates

is around 0.00737%. Taking , . This

demonstrates the analysis stated in Section III for the validity of no-fault assumption, which states that if a machine fault exists, the classified recipe will eventually mismatch the destined one. Compared with CART and See5, the sum of misclassification rates of HCT is better, and this is consistent with the results of the classification error rate shown in Table II. To investigate the training efficiency and the capability of real-time classification of HCT as well as the effects of different values of , we have applied HCT to the 42-recipe case with three other value of . The resulting tenfold cross validation for the sum of classi-fication error rates, the corresponding average training times, and the classification time for classifying the recipe of a new data pattern are shown in Table IV. From the fourth row of this table, we can observe that when , the tenfold cross validation for the sum of classification error rates of HCT is better than that of See5 and CART. From the second row of the table, we see that when , the training time required by HCT is much shorter than that required by CART and See5. The classification time needed for classifying a new data pattern is much shorter than that of See5 and CART for all the indicated values of ; in addition, it is also much shorter than the data measurement time, which takes 1 s; thus, HCT can work in real

(11)

TABLE II

CLASSIFICATIONERRORRATE OF42-RECIPECASE

TABLE III

MISCLASSIFICATIONRATEOF42-RECIPECASE

TABLE IV

TRAININGTIME, CLASSIFICATIONTIME,ANDTENFOLDCROSSVALIDATION FORSUM OFCLASSIFICATIONERRORRATES FORDIFFERENTVALUES OF^p

time. This shows that HCT not only performs better than See5 and CART in the aspect of a tenfold cross validation for the sum of classification error rates but also consumes less training time and classification time when is properly chosen. In the meantime, we found that as increases, the HCT becomes less accurate and less computationally time consuming, as expected. This also demonstrates why the clustering algorithm helps reduce the computational complexity of CART.

B. Test Results of Learning Capability of HCT

We also test the learning capability of the proposed HCT by adding the new data patterns to the training data set. We found that when the accumulated amount of new data patterns is less than 7%, on average, of the amount of training data of the same

TABLE V

UPDATINGTIME OFHCT WHENSEPARATIONMATRICESCHANGE FOR

DIFFERENT^p

recipe, the updated separation matrices remain the same. The length of the window in updating the splitting rules of CART, , is set to be 5. In the case of unchanged separation matrices, the computation time for checking whether there is any change in separation matrices, updating the fuzzy rules of the clustering

(12)

Fig. 11. Classification tree of CART for TC .

algorithm, and the splitting rules for CART take only 0.1637 s for each new data pattern when . This updating time is shorter than measuring a new data pattern; thus, we can perform the online update. In the case when separation matrices change, updating the separation matrices and rerunning Steps 1) and 2) of Algorithm III and the training part of CART for the resulting TCs take only 21.741 s, which is even shorter than the wafer changeover time, which takes 26 s. For different values of , the updating times of HCT when separation matrices change are shown in Table V. From the results, we see that we can update the training part of HCT during the wafer changeover period. This indicates that the learning capability of HCT enables it to update in real time and online. It should be noted that the HCT’s updating time being shorter than the training time is because up-dating the separation matrices is much easier than constructing from nothing.

C. Test Results of Warning Signal Generation and Fault Isolation

To test the validity of the proposed warning signal generation criteria and fault isolation scheme, we use six small sets of mea-sured data patterns, which are also collected from the 42-recipe case but not included in the previous data set for constructing the HCT. Among them, the first two sets consist of abnormal

wafers caused by machine faults, and the other four sets con-sist of abnormal wafers caused by electrical spikes. There are 50 wafers with destined recipe 39 in the first set and the ten ab-normal wafers located from the 21st to the 30th wafers caused by attribute . The second set consists of 40 wafers with destined recipe 6 and the ten abnormal wafers located from the 31st to the 40th wafers. The first abnormal wafer is caused by attribute , and the remaining nine wafers are caused by both and . The third set consists of 20 wafers, and the two abnormal wafers are located at the 16th and 17th wafers caused by the at-tribute , whose values are affected by electrical spikes. The abnormal wafers caused by electrical spikes also occur to the fourth, fifth, and the sixth sets of wafers; these three sets con-sist of 30 wafers each, and the two abnormal wafers are located at the 19th and 20th, 23rd and 24th, and 27th and 28th wafers caused by attributes , , and , respectively. We randomly pick six out of ten existing HCT test data sets and insert the pre-vious six small sets of data patterns into the six test data sets, one for each.

Setting , , and of each recipe as the

result shown in Table II, we apply the HCT associated with the majority voting scheme to classify the six test data sets. In the first test data set, the warning signal generation criteria, i.e., (9) and (10), are satisfied at the fifth abnormal wafer of the first

(13)

TABLE VI

MISCLASSIFIEDABNORMALWAFERSCAUSED BYELECTRICALSPIKES

Thus and . This demonstrates

that our warning signal generation criteria has successfully de-tected the fault. Now, we have according to Step 1) of the fault isolation scheme. We also set , which de-notes the 16th to the 20th wafers in the first small set of test data. The five misclassified wafers are all classified to recipe 24, while the previous five wafers are correctly classified to recipe 39. Since recipes 24 and 39 belong to different TCs, we apply Step 3.1) of the fault isolation scheme and find that there are only two traced-back tree paths, which are TC Cr Cr and TC Cr , respectively, as can be observed from Fig. 10. Thus, the faulty attribute causing the classification errors is with probability 1.0, and the corresponding subsystem is the mass analysis, which is also with probability 1.0. In the second test data set, the warning signal generation criteria are satis-fied at the fifth abnormal wafer, because ; thus,

and . Therefore, we have ,

and we also set . The five misclassified wafers are all classified to recipe 7, while the previous five wafers are all cor-rectly classified to recipe 6. Since recipes 6 and 7 belong to the same TC, TC , as can be observed from Fig. 10, we need to apply Step 3.2) to perform fault isolation. The CART for this TC is shown in Fig. 11. The latest five correctly classified wafers lie in the same terminal node for recipe 6 as indicated by in Fig. 11. However, there are two different terminal nodes for the five misclassified wafers as indicated by in Fig. 11. This is because the first abnormal wafer consists of one faulty at-tribute only, while the remaining four consist of two faulty at-tributes, and . Note that the number inside the parenthesis beside denotes the number of misclassified wafers lying in this node. Applying Step 3.2) of the fault isolation scheme, the traced-back tree paths for recipe 6 indicated by are shown by the solid line and for recipe 7 indicated by are shown by dashed lines in Fig. 11. The faulty attributes, which are the split-ting attributes of the nodes where the traced-back paths meet, are with probability 0.2 and with probability 0.8. The cor-responding subsystem of both and is the mass analysis, which is thus probability 1.0. For the third to the sixth sets of test data, the details of the misclassified abnormal wafers are tabu-lated in Table VI. From this table and Table II, we can easily find that conditions (9) and (10) cannot hold simultaneously. Thus, no warning signal is generated in any of these four cases.

V. CONCLUSION

The proposed classification-based fault detection and isola-tion scheme is a general methodology. Modifying the warning signal generation criteria to meet individual machine’s needs,

this fault detection scheme is not limited to the ion implanter. The simplicity of the HCT-based fault isolation scheme made HCT worthwhile, especially when its accuracy can be reme-died by the warning signal generation criteria when applying it to the ion implanter. Due to the efficient learning capability of HCT and the 0.05-s classification time for classifying the recipe of a working wafer, the proposed fault detection and isolation scheme can work online and in real time.

REFERENCES

[1] C. M. McKenna, “A personal historical perspective of ion implantation equipment for semiconductor applications,” in Proc. Int. Conf. Ion

Im-plantation Technol., Sep. 2000, pp. 1–19.

[2] A. S. Willsky, “A survey of design methods for failure detection sys-tems,” Automatica, vol. 12, no. 6, pp. 601–611, Nov. 1976.

[3] D. M. Himmelblau, Fault Detection and Diagnosis in Chemical and

Petrochemical Processes. New York: Elsevier, Oct. 1978.

[4] R. Isermann, “Process fault detection based on modeling and estima-tion methods—A survey,” Automatica, vol. 20, no. 4, pp. 387–404, Jul. 1984.

[5] D. Barschdorff, Gearbox failure diagnostics VDI-Verlag, Dusseldorf, VDI-Berichte Rep. 644, 1987, pp. 241–248.

[6] S. D. Stearns and D. R. Hush, Digital Signal Analysis, 2nd ed. En-glewood Cliffs, NJ: Prentice-Hall, 1990.

[7] P. M. Frank, “Fault diagnosis in dynamic systems using analytical and knowledge-based redundancy,” Automatica, vol. 26, no. 3, pp. 459–474, May 1990.

[8] R. Isermann, “Fault diagnosis of machines via parameter estimation and knowledge processing,” Automatica, vol. 29, no. 4, pp. 815–835, Jul. 1993.

[9] J. Gertler, Fault Detection and Diagnosis in Engineering Systems. New York: Marcel Dekker, 1998.

[10] M. Basseville and A. Benveniste, Eds., Detection of Abrupt Changes in

Signals and Dynamical Systems. Berlin, Germany: Springer-Verlag, Dec. 1985, vol. 77, Lecture Notes in Control and Information Sciences. [11] D. Neumann, “Fault diagnosis of machine-tools by estimation of signal spectra,” in Proc. IFAC SAFEPROCESS Symp., Sep. 1991, vol. 1, pp. 73–78.

[12] H. H. Yue, S. J. Qin, R. J. Markle, C. Nauert, and M. Gatto, “Fault de-tection of plasma etchers using optical emission spectra,” IEEE Trans.

Semiconduct. Manuf., vol. 13, no. 3, pp. 374–385, Aug. 2000.

[13] R. Isermann, “Model-based fault detection and diagnosis—Status and applications,” presented at the 16th Symp. Automatic Control

Aerospace, St. Petersburg, Russia, Jun. 2004, unpublished.

[14] ——, “Supervision, fault-detection and fault-diagnosis methods—An introduction,” Contr. Eng. Practice, vol. 5, no. 5, pp. 639–652, May 1997.

[15] G. M. Smith, Statistical Process Control and Quality Improvement, 5th ed. Englewood Cliffs, NJ: Prentice-Hall, 2003.

[16] M. H. Dunham, Data Mining: Introductory and Advanced Topics. Englewood Cliffs, NJ: Prentice-Hall, 2002.

[17] L. Breiman, J. H. Friedman, J. A. Olshen, and C. J. Stone, Classification

and Regression Trees. London, U.K.: Chapman and Hall, 1984. [18] R. L. Lawrence and A. Wright, “Rule-based classification systems

using classification and regression tree (CART) analysis,”

Photogram-metric Eng. Remote Sensing, vol. 67, no. 10, pp. 1137–1142, Oct.

2001.

[19] T. Denoeux, “A neural network classifier based on Dempster-Shafer theory,” IEEE Trans. Syst., Man Cybern., vol. 30, no. 2, pt. A, pp. 131–150, Mar. 2000.

[20] G. P. Zhang, “Neural networks for classification: A survey,” IEEE

Trans. Syst., Man Cybern., vol. 30, no. 4, pt. C, pp. 451–462, Nov.

2000.

[21] J. Ediriwickrema and S. Khorram, “Hierarchical maximum-likelihood classification for improved accuracies,” IEEE Trans. Geosci. Remote

Sensing, vol. 35, no. 4, pp. 810–816, Jul. 1997.

[22] L. Bruzzone and D. F. Prieto, “Unsupervised retraining of a maximum likelihood classifier for the analysis of multitemporal remote sensing images,” IEEE Trans. Geosci. Remote Sensing, vol. 39, no. 2, pp. 456–460, Feb. 2001.

[23] J. G. Marin-Blazquez and S. Qiang, “From approximative to descrip-tive fuzzy classifiers,” IEEE Trans. Fuzzy Syst., vol. 10, no. 4, pp. 484–497, Aug. 2002.

(14)

[24] X. Chang and J. H. Lilly, “Evolutionary design of a fuzzy classifier from data,” IEEE Trans. Syst., Man Cybern., vol. 34, no. 2, pp. 1031–1044, Apr. 2004.

[25] D. M. Hawkins and G. V. Kass, “Automatic interaction detection,” in

Topics in Applied Multivariate Analysis, D. M. Hawkins, Ed. Cam-bridge, U.K.: Cambridge Univ. Press, Apr. 1982, pp. 267–302. [26] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo,

CA: Morgan Kaufmann, Jan. 1993.

[27] K. R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, “An introduction to Kernel-based learning algorithms,” IEEE Trans. Neural

Networks, vol. 12, no. 2, pp. 181–202, Mar. 2001.

[28] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, Oct. 2001.

[29] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Ann. Statistics, vol. 29, no. 5, pp. 1189–1232, Oct. 2001. [30] ——, “Stochastic gradient boosting,” Computational Statistics Data

Analysis, vol. 34, no. 4, pp. 367–378, Feb. 2002.

[31] A. Borisov, V. Eruhimov, and E. Tuv, “Boosting flexible learning en-sembles with dynamic feature selection,” presented at the NIPS 2003

Workshop Feature Extraction, British Columbia, Canada, Dec. 2003,

unpublished.

[32] G. Grimmett and D. Stirzaker, Probability and Random Processes, 3rd ed. Oxford, New York: Oxford Univ. Press, May 2001.

[33] O. Cordon, M. Jose del Jesus, and F. Herrera, “A proposal on reasoning methods in fuzzy rule-based classification systems,” Int. J.

Approxi-mate Reasoning, vol. 20, no. 1, pp. 21–45, Jan. 1999.

[34] H. Ishibuchi and T. Nakashima, “Effect of rule weights in fuzzy rule-based classification systems,” IEEE Trans. Fuzzy Syst., vol. 9, no. 4, pp. 506–515, Aug. 2001.

[35] J. R. Quinlan, 2003, Data Mining Tools See5 and C5.0 version 1.20 [Online]. Available: http://www.rulequest.com/see5-info.html

Shin-Yeu Lin was born in Taiwan, R.O.C. He received the B.S. degree in electronics engineering from National Chiao Tung University, Taiwan, in 1975, the M.S. degree in electrical engineering from the University of Texas, El Paso, in 1979, and the D.Sc. degree in systems science and mathematics from Washington University, St. Louis, MO, in 1983. From 1984 to 1985, he was with Washington Uni-versity working first as a Research Associate and then as a Visiting Assistant Professor. From 1985 to 1986, he was with GTE Laboratory working as a Senior MTS. He joined the Department of Electrical and Control Engineering at Na-tional Chiao Tung University in 1987 and has been a Professor since 1992. His research interests include data mining, ordinal optimization theory and applica-tions, and distributed computations.

Shih-Cheng Horng was born in Taiwan, R.O.C. He received the B.S. and M.S. degrees in electrical and control engineering from National Chiao Tung Uni-versity, Taiwan, in 1993 and 1995, respectively. He is currently pursuing the Ph.D. degree from the same university.

He is a Lecturer in the Department of Electronic Engineering, Chinmin Institute of Technology, Miaoli, Taiwan. His research interests include optimization theory with applications to large semi-conductor fab related problems, data mining, and modeling of large complex systems.