Data Pre-processing

3 Methods Design

3.2 Data Pre-processing

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

-14-

Finally, duplicate data, data is integrated or collected from multi-ple sources. While integrating data from multimulti-ple sources, the amount of the data increases and as well as data is duplicated. (Tamilselvi et al., 2011) In my case, the ‘process-name’ in ITEM data-table dupli-cates with ‘route-name’ in WIP data-table, hence, duplicate data ex-isting, one should be removed.

After finishing the above procedures, since the raw data is from two different system, it need to be found a proper foreign key to merge this two data sources into a dataset for data preprocessing.

3.2 Data Pre-processing

Dirty data could be due to reasons such as sensor failure, data transmission or improper data entry, many of them may be unknown at the time of data collection, so data pre-processing is important and time-consuming stage in data mining. Data preprocessing is con-sisted of "data filtering", "data cleaning", "data transformation" and

"data reduction" and in the step of data-cleaning, I should consider noise data, miss-value data, outlier data, then I could improve data quality for model building. In the following sections, I try to describe its routine handlers for the specified situation considered in this re-search work.

3.2.1 Data Filtering

Whenever data is collected and used for a mining project, the miner needs to have some underlying idea, rationale, or theory as to

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

-15-

why that particular data set can address the problem area. This idea, rationale, or theory forms the explanatory structure for the data set.

It explains how the variables are expected to relate to each other, and how the data set as a whole relates to the problem. It establishes a reason for why the selected data set is appropriate to use. (Pyle, 1999) So I select massive production WIP except experimental and engineering lots, and the special operation of IC substrate, drilling, to be the empirical implementation objective in this research work.

3.2.2 Data Cleaning

In the step of “data cleaning”, there are really a lot of particular and complex cases and different technical approaches should be adopted respectively in the real world.

(a) Missing data

If it is noted that there are many tuples that have no recorded value for several attributes, then the missing values can be filled in for the attribute by various methods (Arora et al., 2012) described below, and methods of a. b. f. applied in this work.

a. Ignore the tuple.

b. Fill in the missing value manually.

c. Use a global constant to fill in the missing value.

d. Use the attribute mean to fill in the missing value.

e. Use the attribute mean for all samples belonging to the same class as the given tuple.

f. Use the most probable value to fill in the missing value.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

-16-

(b) Noise data

Noise is a random error or variance in a measured variable. Noise in the data can be attributed to several reasons: one is that data measurement or transmission errors, and the other is that inherent reasons such as characteristics of processes or systems from which data is collected. (Arora et al., 2012) They could be settled by several approaches such as the followings. After some consideration of causal relation and information keeping, then binning method is applied here for all data column which value is numerical scale.

a. Binning method b. Clustering method c. Regression method

d. Combined computer and human inspection (c) Outlier data

An outlier is a single, or very low frequency, occurrence of the value of a variable that is far away from the bulk of the values of the variable. It is a problem because, for some modeling methods in par-ticular (some types of neural network, for instance), outliers may dis-tort the remaining data to the point of uselessness. For outlier data, although I consider the below approaches (Chien et al., 2014) to set-tle, finally, interquartile range (IQR) measurement to identity and re-move outlier data is adopted.

a. To eliminate directly.

b. To use clustering method.

c. To use interquartile range (IQR) measurement.

d. To alternate it by another values using standardized

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

-17-

transformation.

3.2.3 Data Transformation

In the step of data transformation, data would be made into the appropriate forms, as well, it could be convert and consolidate into an obvious forms for data-mining. There are a lot of specialized ap-proaches of data transformation for particular case. In this research, some approaches applied are convention of dummy-variable, aggre-gation, normalization and standardization. (Arora et al., 2012)

(a) Convention of dummy-variables, where a dummy variable, also known as indicator variable, is an artificial variable created to represent an attribute with two or more distinct catego-ries/levels.

(b) Aggregation, where summary or aggregation operations are applied to the data.

(c) Normalization, where the attribute data are scaled so as to fall with in a small specified range usually a primary target scale.

(d) Standardization, where it is also well-known as Z-score trans-ferred in Statistics.

3.2.4 Data Reduction

Complex data analysis and mining on huge amounts of data may take a very long time, making such analysis impractical or infeasible.

Data reduction techniques have been helpful in analyzing reduced representation of the dataset without compromising the integrity of

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

-18-

the original data and yet producing the quality knowledge. The con-cept of data reduction is commonly understood as either reducing the volume or reducing the dimensions. (Arora et al., 2012) In this work, the following approaches to facilitate fulfilling this step.

(a) Generalization, where low level or primitive (raw) data are replaced by higher level concepts through the use of concept hierarchies. It is also known as “concept hierarchy generation”.

(b) Discretization is a process of quantizing continuous attributes.

In other words, it is the process of putting values into buckets so that there are a limited number of possible states. The suc-cess of discretization can significantly extend the borders of many learning algorithms.

在文檔中 IC基板製程時間之特徵選擇研究－以鑽孔作業為例 - 政大學術集成 (頁 21-25)

3 Methods Design

3.2 Data Pre-processing

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

3.2 Data Pre-processing

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學