The proposed decision tree method - 挖掘可語意解讀之知識並預測蛋白質之殘基與去氧核醣核酸之鍵結

Decision tree [7] is a popular machine learning method to classify the value of a discrete dependent variable with a finite set. The basic decision tree example and algorithm[23] are given in Figure 3.1 and 3.2, respectively. Decision tree learning is a method for approximating discrete-valued target functions that is robust to noisy data and capable of learning disjunctive expressions [5]. Learned trees can also be re-represented as sets of if-then rules to improve human readability.

A decision tree is constructed by looking for regularities in data. According to entropy calculation, we can select one with the minimum entropy from these features.

Given a collection S, if the target attribute can take on c different values, then the entropy of S relative to this c-wise classification is defined as

i decided by the rule. The distributions of the levels of the tree are important and readable information because we could analyze which feature is more significant than others [5, 7]. In our wok, C5.0 [24], an update version of C4.5 [25] algorithm, is applied in the proposed decision tree method.

Figure 3.1. An example for decision tree method: A simple is input the system. If its Feature 2 value belongs to Group 2-1, it will leave for Feature 1 node. And then if its Feature 1 value is a portion of Group 1-2, the decision tree system will show it is a member of Class Y. Other classification pathways can be analogized by the above-mentioned mode.

Figure 3.2. The basic decision tree algorithm

Group 3-2

Group 2-1 Group 2-2

Group 2-3

Group 1-2 Group 1-1 Group 3-1

Feature 2

Feature 1 Feature 3

Y N Y N

3.1.1 The Parameters Setting

Over-fitting is a significant practical difficulty for the bulk of machine learning methods. Figure 3.3 [5] illustrates the impact of over-fitting in a typical application of decision tree learning. There are two major approaches to avoiding over-fitting in DT.

These approaches are to stop growing the tree earlier and post prune [5]. Pruning a decision node consists of removing the sub-tree rooted at that node, making if a leaf node, and assigning it the most common classification of the training examples affiliated that node [5]. To determine the correct final tree size were reported in many researches [26-29]. In this research, we utilize post pruning method [25]. That the pruning parameter, confidence factor (cf), certainly affects the performance about error rates is estimated in our experiment later.

Figure 3.3. Over-fitting in decision tree learning: As DT system adds new nodes to grow the decision tree, the accuracy of the tree measured over the training examples increases monotonically. However, when measured over a set of test examples independent of the training examples, accuracy first increases, then decreases.

As a result of unbalanced distribution of samples, the penalty will be considered to avoid that accuracy about binding ones are sacrificed. The parameter setting is equal to increase the binding influence for the classification results. And it would enhance

NP performance.

Furthermore, the idea of adaptive boosting algorithm [24, 30] to create several decision trees is used in our method. Boosting is a technique for generating and combining multiple classifiers to improve predictive accuracy [24]. When a new case is to be classified, each decision trees vote for its predicted class and the votes are counted to determine the final class. In general, to predict the unknown data by using more decision trees will get a better accuracy than by only using one in our research.

3.1.2 The judgment for the attribute of the features

To select features which have more attribute for classifying example is critical. What is a good quantitative measure of the worth of an attribute? We would define a statistic property that measures how well a given attribute separates the training examples according to their target classification [5]. The best feature choice to build trees generally leads to simple decision at the nodes [6]. A variety of selection attributes measures have been proposed in past researches [31-33]. In the thesis, we would refer three kinds of judgment for the attribute. The source of the part about measuring the attribute comes from [5].

The measure is simply the expected reduction in entropy caused by partitioning the examples according to this attribute. First used judgment function is Information Gain.

It is defined as

Where Values(A) is the set of all possible values for attribute A, and S_v is the subset of S for which attribute A has value v. The first term in Equation (3.2) is just the entropy of the original collection S, and the second term is the expected value of the entropy after S is partitioned using attribute A. The expected entropy described by this second term is the sum of the entropies of each subset S_v, weighted by the fraction of examples

However, Equation (3.2) has so many possible values that it is bound to separate the training examples into very small subsets. Because of this, it will have a very high information gain relative to the training examples, despite being a very poor predictor of the target function over unseen instances. Therefore, one alternative measure that has been used successfully is the gain ratio [7]. The gain ratio measure by incorporating a term, called Potential Information:

Potential Information(S,A)

∑

where S₁ through S_c, are the c subsets of examples resulting from partitioning S by the c-valued attribute A. And Gain Ratio measure is defined in terms of the earlier Information Gain measure, as well as this Potential Information, as follows

Gain Ratio(S,A)

In thesis, these features adopted from the dataset can be chiefly ranked by Gain Ratio to prediction. The attributes chosen imply that they own the maximum distinct ability for each split.

3.1.3 Training and Test

The results reported in this thesis mainly show three-fold cross validation (3-CV). It is to say the data are divided into three approximately equal parts. And then, two parts are training data and another is test data by turns. During the training, the nodes of each level in decision tree will be established gradually. And then, the test data are input the system. We could get the first result. Following the process, the three parts take turns the test and training data. On account of using three-fold cross validation, the final results in the thesis are the average of the three times test results.

在文檔中挖掘可語意解讀之知識並預測蛋白質之殘基與去氧核醣核酸之鍵結 (頁 23-27)