Summary - EXPERIMENT AND RESULTS - 以規則為基礎的分類演算法：應用粗糙集

CHAPTER 5 EXPERIMENT AND RESULTS

5.4 Summary

Table XXI. The comparison of DiscPow_Chi and Information Gain feature selection.

Data sets

DiscPow_Chi InfoGain DiscPow_Chi InfoGain

Agaricus-lepiota 5/22 100.0 99.9 100.0 99.9

Audiology.standardized 13/69 69.0 66.5 76.0 70.5

Car 6/6 88.3 88.3 92.7 92.7 contradictions which are originally in the data, since ROUSER simply ignores contradictions.

The embedded feature selection method is deterministic and more possible to avoid accuracy loss. ROUSER has some good properties, and how to keep these good properties while avoiding the shortcomings would be the focus of the feature work.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

CHAPTER 6 CASE STUDY

6.1 Introduction

The objective of this case study is to find out the cause of machine fault of a roughing mill in a hot strip mill of the largest steel making company in Taiwan. First we will introduce how to implement data mining techniques in a cooperative data analysis. Second we will describe the background knowledge of the case study and make a brief explanation to the data.

Then we will show how we build up models and rules to analyze the cause of machine fault.

After that, we will look into the data and inspect the rules we built. Finally we will give a conclusion about this case study.

6.1.1 Back Ground Knowledge

We first introduce the background knowledge about the manufacturing process. The function of a hot strip mill is to turn a slab into a coil for the convenience of further process.

There are two hot strip mills, and the structures inside them are different. Our attention is on one of the two hot strip mills. After a slab enters the hot strip mill, it must be heated up at the furnace to become soft. A prepared slab will then enter the roughing mill. A roughing stand contains two parts, the edge mill and the rough mill; the former adjusts the slab in a good width, while the later thins the slab. After this, a slab becomes a transfer bar. The transfer bar will be sliced into pieces by the crop shear, and finally it enters the finish mill and becomes a coil after cooling.

‧

mill directly contact the slab when rolling, and the torque comes from the engines connected by the spindle. In these recent years the spindles broke frequently, and the experts suspect that the cause is that a slip happens during the rolling process. The rough mill rolls the slab back and forth for 5 passes, and each rolling pass makes the slab thinner. A slip may occur in each rolling pass, and the spindles may suffer unexpected impact and a slip may lead to metal fatigue.

6.1.3 Data

The data collected from the mill can be roughly categorized into three types, namely the materials, the mill, and the rolls.

Material Data

Material data contains the features about the material, such as the steel family, and the steels in the same family have similar properties. The material data also contains the thickness drafts of each pass performed by the rough mill.

Mill Data

Mill data comes from the mill itself. Some attributes like the moving speed of a slab are not easily to measure directly, and hence the experts measure the rolling speed from the mill to represent the slab moving speed; the speed draft of a slab is also a parameter setting of the mill. The slab has a threading speed, which is the initial speed of threading. The running speed is the speed right before the slab enters the mill. Only the default settings of these two speeds are recorded. The roll torque of working rolls is generated by the motors of mill. The roll force pressed on the slab is also generated by the motors of mill.

‧

There are two working rolls, top and bottom, and there are also plenty of measurement results about the working rolls. Here we introduce the rolling torque only. The difference between the roll torque and the rolling torque is that the roll torque is measured from the motors of the mill while the rolling torque is measured from the working rolls themselves.

6.2 Model Building

6.2.1 Data Preprocessing Data Cleaning

There are 21,907 data records and 187 attributes (excluding the class attributes) in the original data set, which is collected from the hot strip mill for 2 months. After data cleaning, including removing error data records, redundant attributes, duplicate attributes, serial numbers, and time stamps, 21,891 data records and 172 attributes (excluding the class attributes) remain.

Data Reformatting

Each data record is bound to a particular piece of material, which is originally a slab and finally a coil, and hence data collected from 5 passes sticks together in one record. The static information such as material data we introduced before is also included. Torque ratio of each pass is also in one record. It is obvious that data in this format is not suitable for any classification algorithm to analyze, so we reformat the original data set into 5 data sets based on the pass number.

As we mentioned before, materials in the same family share similar physical properties, and hence we divide the original data for each family. There are 27 families in the original data set, and 5 passes for each record, and hence the original data set is reformatted into

‧

implementation of Fayyad & Irani's MDL method [6].

Data Integration: Torque Ratio

The torque ratio is the class attribute, and it is fused from the rolling torque we just mentioned. The value of torque ratio can be used to determine a slip is happening or not. The value range of torque ratio is [0, 1]. The safe range of pass 1 is (0.45, 0.55), and so is that of pass 2; the safe range is (0.4, 0.6) for pass 3; and it is (0.35, 0.75) for passes 4 and 5. The others are slip range.

Feature Selection

Some attributes are removed, and the reasons vary. Some of them are removed since they are already known to be dependent on the class attribute, and this type of attributes will dominate the results. However they are not helpful to explain the problem. Another reason of why the attributes are removed is that the timing they are measured is after the slip happens.

Absolute time stamps and serial numbers are also removed.

Slip data records are rare in the final data sets.

6.2.2 Classification Algorithms

We use ROUSER to analyze the discretized data sets, and we use JRip and J48 provided by Weka to analyze data sets with real numbers. We use the default setting of JRip (-F 3 -N 2.0 -O 2 -S 1) together with two more settings (-F 3 -N 1.0 -O 0 -S 1 -E, -F 3 -N 1.0 -O 0 -S 1 -E -P). We also use the default setting of J48 (-C 0.25 -M 2) together with two more setting (-S -C 0.5 -M 1, -S -C 0.25 -M 2). Since the slip records are rare in the data set, the default

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

settings of JRip and J48 may consider them as noise and ignore them to pursue the overall accuracy; as a result, the models are too simple and explain nothing. So we try several different settings to remove the mechanisms which are designed to prevent models from growing too luxuriant and becoming over fitting.

6.3 Inspect the Result Rules

We generate many rule sets from 135 data sets with 3 algorithms and different settings, and we give some of the rules. Following are the rules generated by J48 on the data set of family 27 at pass 5:

(R2 roll torque_pass5 [kNm]\[1] >= 3078.667) => Torque ratio p5=[0.65,1] (27.0/0.0).

(R2 roll torque_pass5 [kNm]\[1] >= 2585.333) and

(R2 total roll force_pass5 [kN]\[1] <= 23046.67) => Torque ratio p5=[0.65,1] (3.0/0.0).

The first rule indicates that if the torque value measured from the motors of mill for pass 5 is bigger than or equal to 3078.667, then the torque ratio of pass 5 will be in the range [0.65,1], which is a slip range. The second rule indicates that if the torque value measured from the motors of mill for pass 5 is bigger than or equal to 2585.333, and the force value measured from the motors of mill for pass 5 is smaller than or equal to 23046.67, then the torque ratio of pass 5 will be in the slip range [0.65,1]. From these two rules we may conclude that when torque measured from the motors of mill is too high, and sometimes when the force measured from the motors of mill is too low, a slip may occurs.

Our job is to summarize the results and let the experts to inspect the results. We need feedbacks from the experts to improve the experiments.

‧

Consider the rules provided above, the torque value measured from the motors of mill for pass 5 attribute appears twice, and the force value measured from the motors of mill for pass 5 appears once. The more frequent an attribute appears in the rules, the more important the attribute is, especially when we built plenty of rules.

From all rule sets we discover one same phenomenon that the torque measured from the motors of mill when biting in a slab is the most frequent attribute for passes 3, 4, and 5. The rolling speed measured from the motors of mill is the second most frequent attribute for passes 3, 4, and 5.

We also discover that some attributes never appear in any rule. This discovery may help the experts to reduce dimensions when building a predictor.

The third most frequent attribute for passes 3 and 4 is the rolling speed of the top working roll, which is preferred by JRip and J48, and the third most frequent attribute for passes 5 is the bottom working roll number, which is preferred by ROUSER and never chosen by JRip and J48. Both of these results are considered reasonable to experts. We discovered that JRip and J48 prefer real number attributes and they may overlook some important nominal attributes.

From the results we find that the default settings of running speed, threading speed, force, and torque, are involved, while the thickness draft of each pass are not involved. We look into the data to seek out the evidence of this discovery. First, we find that thickness draft of each pass differs only a little in each data record, which may be the reason of why the thickness is not involved. Second, we find that different records with exactly the same slab properties (such as family) and the same size of finished products may have different settings on the mill, and some setting combinations are rare with regard to the other records with the same slab properties, and these rare settings are usually accompanied with slip. We considered

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

this phenomenon as one of the causes of slip.

6.4 Summary

Through data mining techniques we narrow the exploring range of the problem happened in a rough mill. The attributes chosen by our experiments are considered reasonable, and we find that JRip and J48 are good at capturing important real number attributes, while ROUSER is good at capturing the important nominal attributes. The results also respond to the experts’ doubts about the default settings. Following the narrowed clues, we look into the data and find some evidences to explain that the default settings may be one cause of slip.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

CHAPTER 7 CONCLUSIONS AND FUTURE WORK

A rule-based classification algorithm named ROUSER is proposed. It is designed to process nominal data and generate human understandable decision rules. ROUSER uses a rough set approach as its search heuristic, and the rule generation method of ROUSER is based on the separate-and-conquer strategy.

As a prototype without the optimization stage or the pruning stage to reduce errors, ROUSER still provides classification performance comparable to or even better than that given by the rule-based or tree-based classification algorithms considered in experiments.

Since the search heuristics of ROUSER is totally different from the search heuristics (Entropy and Information Gain) used by the other three algorithms, the results imply that the proposed PotBound and DiscPow are useful. This also shows the potential of ROUSER and gives an

example of future work.

For future work, we plan to conduct more experiments, develop better strategies to select attributes and handle contradictions, and apply ROUSER to data sets obtained from a real-world case study.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

REFERENCE

[1] J. G. Bazan, H. S. Nguyen, S. H. Nguyen, P. Synak, J. Wróblewski, and blewski, "Rough set algorithms in classification problem," in Rough set methods and applications, ed:

Physica-Verlag GmbH, 2000, pp. 49-88.

[2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming: Athena Scientific, 1996.

[3] W.W. Cohen, “Fast Effective Rule Induction,” Proc. 12th Int'l Conf. Machine Learning (ICML), pp. 115-123, 1995.

[4] C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, pp.

273-297, 1995.

[5] J. Dai, Q. Xu, and W. Wang, "A comparative study on strategies of rule induction for incomplete data based on rough set approach," International Journal of Advancements in Computing Technology, vol. 3, p. 176–183, 2011.

[6] U. M. Fayyad, K. B. Irani, “Multi-interval discretization of continuous-valued attributes for classification learning”, presented at the Proceedings of 13th international joint conference on Artificial intelligence, 1022-1027, 1993.

[7] J. Fürnkranz, "Separate-and-Conquer Rule Learning," Artif. Intell. Rev., vol. 13, pp. 3-54, 1999.

[8] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The WEKA data mining software: an update," SIGKDD Explor. Newsl., vol. 11, pp. 10-18, 2009.

‧

[9] L. Huan and R. Setiono, "Chi2: feature selection and discretization of numeric attributes,"

in Tools with Artificial Intelligence, 1995. Proceedings of IEEE Seventh International Conference on, 1995, pp. 388-391.

[10] J. C. Huhn and E. Hullermeier, "FR3: A Fuzzy Rule Learner for Inducing Reliable Classifiers," Fuzzy Systems, IEEE Transactions on, vol. 17, pp. 138-149, 2009.

[11] W. Jiabing, Z. Pei, W. Guihua, and W. Jia, "Classifying Categorical Data by Rule-Based Neighbors," in Data Mining (ICDM), 2011 IEEE 11th International Conference on, 2011, pp. 1248-1253.

[12] X. Jin, A. Xu, R. Bie, and P. Guo, "Machine Learning Techniques and Chi-Square Feature Selection for Cancer Classification Using SAGE Gene Expression Profiles Data Mining for Biomedical Applications." vol. 3916, J. Li, Q. Yang, and A.-H. Tan, Eds., ed:

Springer Berlin / Heidelberg, 2006, pp. 106-115.

[13] M. T. Mitchell, Machine Learning, 1997 :McGraw-Hill

[14] G. Pagallo and D. Haussler, "Boolean Feature Discovery in Empirical Learning,"

Machine Learning, vol. 5, pp. 71-99, 1990.

[15] Z. Pawlak, "Some Issues on Rough Sets,” Transactions on Rough Sets I, vol. 3100, J.

Peters, A. Skowron, J. Grzymala-Busse, B. Kostek, R. Swiniarski, and M. Szczuka, Eds., ed: Springer Berlin / Heidelberg, 2004, pp. 1-58.

[16] Z. Pawlak, A. Skowron, "Rudiments of rough sets", Information Sciences, vol.177, no.1, pp.3-27, 2007.

[17] J. R. Quinlan, "Induction of Decision Trees," Machine Learning, vol. 1, pp. 81-106, 1986.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

[18] J. R. Quinlan, C4.5: programs for machine learning: Morgan Kaufmann Publishers Inc., 1993.

[19] J. Stefanowski and K. Slowiński, "Rough Set Theory and Rule Induction Techniques For Discovery of Attribute Dependencies in Medical Information Systems,” Principles of Data Mining and Knowledge Discovery. vol. 1263, J. Komorowski and J. Zytkow, Eds., ed: Springer Berlin / Heidelberg, 1997, pp. 36-46.

[20] S. M. Weiss and N. Indurkhya, "Reduced complexity rule induction," presented at the Proceedings of the 12th international joint conference on Artificial intelligence - Volume 2, Sydney, New South Wales, Australia, 1991.

[21] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. McLachlan, A.

Ng, B. Liu, P. Yu, Z.-H. Zhou, M. Steinbach, D. Hand, and D. Steinberg, "Top 10 algorithms in data mining," Knowledge and Information Systems, vol. 14, pp. 1-37, 2008.

[22] L. A. Zadeh, "Fuzzy Sets," Information and Control, vol. 8, pp. 338–353, 1965.

[23] Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[24] "Data Mining Curriculum". ACM SIGKDD. 2006-04-30. Retrieved 2011-10-28.

在文檔中以規則為基礎的分類演算法：應用粗糙集 - 政大學術集成 (頁 62-0)