• 沒有找到結果。

Chapter 2 Literature Review

2.2 Decision Trees

Tree based classification algorithms are used in many different ways. Wu, M. C., el al. (2006) used C4.5 as an enhancement of the filter rule. The performance improved substantially with the help of decision tree. Lin, S., et al. (2003) presents that in many application domains, experts can perform the task, but cannot tell how.

Sometimes machine-derived classifiers works well than the expert-derived classifiers.

Within the widely usage, the most noticeable point of decision tree is the expressive power. In areas like text categorization, although SVM and k-nearest neighbor (kNN) outperform neural networks and Naive Bayes classifiers, rule based algorithms like decision tree are still used due to this particular property (Yang, Y., & Liu, X. 1999).

In this study, we will mainly focus on SVM and Decision Tree for our needs and due to the algorithms’ characteristics.

2.2 Decision Trees

Decision tree is one of the most popular classification algorithms which consist of leafs and decision nodes. A leaf indicates a class while a decision node defines which branch a test should go based on a single attribute value (Quinlan, J. R., 2014). A decision tree generates its initial tree by divide and conquer algorithm (Wu, X., et al., 2008). It abides by two rules:

 If all the cases in set A are belonging to the same class or it cannot be partitioned by any attribute value any more, we make set A as a leaf and labeled it as the most frequent class A.

 Otherwise, select an attribute that can divide set A into two or more subsets, making this attribute as a decision node and applying the rules on the subsets recursively until we reach the end.

However, this may give us lots of different trees because there are many

possibilities in each node if we have many attribute values. How to find the best tree is something we need to consider.

The overall approach to solve this problem is to find the best attribute that best divides all the samples into certain classes at each node. There are different selection measures for this procedure (Mingers, J., 1989):

 Gini impurity, used by CART algorithm.

 Information gain, used by ID3, C4.5 and C5.0 algorithm.

 Chi-Square, used by CHAID.

 Others.

However, decision trees have developed for many years, there are a lot of variants based on this algorithm. For example, random forest algorithm combined several trees together to generate the best performance. Some decision tree algorithms use vote mechanism on the basis of results of different decision trees. In this research, we will use an algorithm called C4.5 which is based on information gain because it is a classic decision tree algorithm through many years according to the top 10 algorithms in data mining collected by Wu, X., et al. in 2008.

Information gain is a selection measurement method proposed by Quinlan at 1978. It is based on a classic formula from Shannon’s information theory. The following passages are going to explain what information gain is about and how it works during the tree node selection process (Quinlan, J. R., 2014).

Shannon thinks all information can be measured. The information conveyed by a message depends on its probability P and is measured in bits by a formula like this:

‧ 國

立 政 治 大 學

Na tiona

l Ch engchi University

7

Info = −(log2𝑃) bits

For example, if there are four equal probable messages, the probability for each one is 1 4⁄ , the information conveyed by any one of them is − log2(1 4)⁄ , which means 2 bits.

Figure 2.2 Information gain: set S with n classes

Given a set S that can be partitioned into n classes, see Figure 2.2, |𝑆| denotes the total number of cases in set S. Let freq(𝐶𝑗, 𝑆) be the number of cases in set S that belongs to class 𝐶𝑗. The probability 𝑃𝑗 for a random case to be in class 𝐶𝑗 is:

𝑃𝑗 = freq(𝐶𝑗, 𝑆)

|𝑆|

Then the information conveyed in set S (also called the entropy of set S) with n classes is the multiplication of the probability of each class and the corresponding information bits:

𝐼𝑛𝑓𝑜(𝑆) = ∑ 𝑃𝑗× (− log2𝑃𝑗)

𝑛

𝑗=1

Now let’s introduce a criterion X, i.e. a decision node. X will divide set S into m subsets, see Figure 2.3.

‧ 國

立 政 治 大 學

Na tiona

l Ch engchi University

8

Figure 2.3 Information gain: set S divided by criterion X We define the information conveyed in set S with criterion X as:

𝐼𝑛𝑓𝑜𝑋(𝑆) = ∑|𝑆𝑖|

|𝑆| × 𝐼𝑛𝑓𝑜(𝑆𝑖)

𝑚

𝑖=1

For each subset 𝑆𝑖, we use the same definition to measure its entropy. The information we gained by partition S in accordance with the test X can be denoted as follows:

gain(X) = 𝐼𝑛𝑓𝑜(𝑆) − 𝐼𝑛𝑓𝑜𝑋(𝑆)

The purpose is to maximize gain(X), which means minimize 𝐼𝑛𝑓𝑜𝑋(𝑆) in each node. In this way, we can generate an optimized decision tree that best divide all the samples into proper classes. So there is a high probability for a new input case to be classified to the correct class it belongs to. That's the essence of a decision tree based on information gain theory.

However, if there are many attribute values, we might get a really big tree.

Besides, since this tree is built on the sample data, there is a high possibility that it may not be applicable when new cases are introduced. So pruning was brought up, it means cut some branches according to some rules during the tree building process.

This makes decision tree comparably more efficient and generalized.

To sum up, decision tree classifier generates rules that is very easy to understand

and explain. It also has high accuracy. The drawback for decision tree is the speed and memory problem, especially when the data is really big. Because of the characteristic of decision trees, it has to scan and sort all the nodes each time, beside this, it need to store all the data in the memory. If your data is very large, you might encounter an out of memory problem before you get the classified result.

相關文件