Research Background - 基因演算之模糊ID3方法和其決策樹的修剪研究

Chapter 1. Introduction

1.1. Research Background

Learning is very important for human being. An infant learn how to eat, how to speak and how to walk. Without learning, people are incapable to profit from their experience or to adapt to changing conditions. Learning is an essential component of any intelligent system, whether human, animal, or machine. There are two significant kinds of learning, one is acquisition of new knowledge and the other is getting new skills. With learning, system can get experience or get some knowledge form processing.

In modern life, we observe exponential growth of the amount of data and information available on the Internet and in database systems. But the data is always disorganized and difficult to understand. Researchers often use machine learning (ML) algorithm to automate the processing and extraction of knowledge from data.

Inductive ML algorithms are used to generate classification rules from class-labeled examples that are described by a set of numerical (e.g., 1, 2, 4), symbolic (e.g., black, white), or continuous attributes. With analysis of the data, we can get the information or the regulations from it.

Machine learning [1] is programming computers to optimize a performance criterion using example data or past experience. We have a model defined up to some parameters, and learning is the execution of a computer program to optimize the

parameters of the model using the training data or past experience. The model may be predictive to make predictions in the future, or descriptive to gain knowledge form data, or both.

Machine learning uses the theory of statistics in building mathematical models, because the core task is making inference from a sample. The role of computer science is twofold: First, in training, we need efficient algorithms to solve the optimization problem, as well as to store and process the massive amount of data we generally have. Second, once a model is learned, its representation and algorithmic solution for inference needs to be efficient as well. In certain applications, the efficiency of the learning or inference algorithm, namely, its space and time complexity may be as important as its predictive accuracy.

The meaning of machine learning is to develop techniques to allow computers to

“learn.” Briefly, machine learning is a method for analyzing of data sets by computer programs, it is better than the intuition of engineers. The system often generate knowledge form the data set, and it is often shown in the form of decision trees [2], which are the most popular choices for learning and reasoning from feature-based examples.

Machine learning has two stages, which finds the common properties between the set of examples in the database and classifies them into different classes, according to the model as shown in Fig. 1.1. In the first stage, we analyze the data set by the algorithm. We will get knowledge in the process which is in the form of decision rules or mathematical formulae. In the second stage, we use testing data to estimate the accuracy of the decision rules which generated previously. If the testing

accuracy is considered acceptable, the decision rules or mathematical formulae can be built as rule-base. We can use it to classify the testing data or new data examples which the categories are not known.

Data

Testing Data

Training Data

Training algorithm

Rule-base

Classification &

Accuracy

New Data

Training

Fig. 1.1. Machine learning process.

Machine learning algorithms can be categorized in several ways. In general, they are divided into supervised and unsupervised algorithms [3]. The supervised learning algorithm is told to which class each training example belongs. In case where there is no a priori knowledge of classes, supervised learning can be still applied if the data has a natural cluster structure. Then a clustering algorithm [3] has to be run first to reveal these natural groupings. In unsupervised learning, the system learns the classes

on its own. This type of learning does the classification by searching trough common properties existing among the data.

There are many ways to acquire knowledge automatically. Decision tree induction [2] has been widely used in extracting knowledge from feature-based examples for classification. A decision tree based classification method is a supervised learning method that constructs decision trees from a set of examples. The quality of a tree depends on both the classification accuracy and the size of the tree. One of the most significant developments in this domain is the ID3 algorithm, which is a popular and efficient method of making a decision tree for classification from symbolic data without much computation.

ID3 stands for “Iterative Dichotomizer (version) 3,” and is a decision tree induction algorithm, developed by Quinlan [4], and later versions including C4.5 [5]

and C5.0 [6]. In the ID3 approach, we make use of the labeled examples and determine how features might be examined in sequence until all the labeled examples have been classified correctly. However, ID3 algorithm does not directly deal with continuous data. If the attributes of the training set has continuous values, the algorithms must be integrated with a discretization algorithm like CART [7] and C4.5, which transforms them into several intervals, but these decision trees are not easy to understand because we cannot know how a range of attribute is divided into intervals, and moreover most knowledge associated with human’s thinking and perception has imprecision and uncertainty. In addition, the fuzzy version of ID3 based on minimum fuzzy entropy was proposed. Investigations to fuzzy ID3 could be found in [8]–[13].

在文檔中基因演算之模糊ID3方法和其決策樹的修剪研究 (頁 10-14)