Learning of rules and trees aims at inducing classifying expressions in the form of decision rules and trees from example cases. These are one of the ognition (McCallum and Li, 2003) and timex recognition and normalization
per discourse heuristic (Bunescu and Mooney, 2004).
models (Jordan, 1999). The advantage is that the feature functions can
oldest approaches to machine learning and were also part of one of the oldest applications of machine learning in information extraction. Each de-cision rule is associated with a particular class, and a rule that is satisfied, i.e., evaluated as true, is an indication of its class. Thus, classifying new cases involves the application of the learned classifying expressions and assignment to the corresponding class upon positive evaluation.
The rules are found by searching these combinations of features or of feature relations that are discriminative for each class. Given a set of posi-tive examples and a set of negaposi-tive examples (if available) of a class, the training algorithms generate a rule that covers all (or most) of the positive examples and none (or fewest) of the negative examples. Having found this rule, it is added to the rule set, and the cases that satisfy the rule are removed from further consideration. The process is repeated until no more example cases remain to be covered.
The paradigm of searching possible hypotheses also applies to tree and rule learning. There are two major ways for accessing this search space most general towards the most specific hypothesis. One starts from the most general rule possible (often an empty clause), which is specialized at the encounter of a negative example that is covered. The principle is to add features to the rule. Specific-to-general methods search the hypothesis space from the most specific towards the most general hypothesis and will progressively generalize examples. One starts with a positive example, which forms the initial rule for the definition of the concept to be learned.
This rule is generalized at the encounter of another positive example that is not covered. The principle is to drop features. The combination of the gen-eral-to-specific and the specific-to-general methods is the so-called version tive examples specify the most general hypothesis. Positive examples gen-eralize the most specific hypothesis. The version spaces model suffers from practical and computational limitations. To test all possible hypothe-ses is most of the time impossible given the number of feature combina-tions.
The most widely used method is tree learning. The vectors of the train-ing examples induce classification expressions in the form of a decision tree. A decision tree can be translated in if-then rules to improve the read-ability of the learned expressions. A decision tree consists of nodes and branches. Each node, except for terminal nodes or leaves, represents a test or decision and branches into subtrees for each possible outcome of the test. The tree can be used to classify an object by starting at the root of the tree and moving through it until a leaf (class of the object) is encountered.
5.6 Decision Rules and Trees 119
(Mitchell, 1977). General-to-specific methods search the space from the
spaces method, which starts from two hypotheses (Mitchell, 1977).
Nega-120 5 Supervised Classification
top-down, greedy way by selecting the most discriminative feature and use it as a test to the root node of the tree. A descendant node is then created for each possible value of this feature, and the training examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the example’s value for this feature). The entire process is then repeated using the training examples associated with each descendant node to select the best feature to test at that point in the tree. This forms a greedy search for an acceptable decision tree, in which the algorithm never backtracks to reconsider earlier choices. In this way not all the hypotheses of the search space are tested. Additional mechanisms can be incorporated. For instance, by searching a rule or tree that covers most of the positive examples and removal of these examples from further training, the search space is di-vided into subspaces, for each of which a covering rule is sought. Other ways for reducing the search space regard preferring simple rules above complex ones and by branching and bounding the search space when the method will not consider a set of hypotheses if there is some criterion that allows assuming that they are inferior to the current best hypothesis. The selection of the most discriminative feature at each node except for a leave node, is often done by selecting the one with the largest information gain, i.e., the feature that causes the largest reduction in entropy when the train-ing examples are classified accordtrain-ing to the outcome of the test at the node. As seen above, entropy is a measure of uncertainty.
More specifically, given a collection S of training examples, if the clas-sification can take on k different values, then the entropy of S relative to the k classifications is defined as: k
Entropy(S)≡ −pilog2
i= 1 k
¦
2 ip (5.59)where piis the proportion of S belonging to class k. The information gain of a feature f is the expected reduction in entropy caused by partitioning the examples according to this feature.
Gain(S, f )≡ Entropy(S) − Sv
v∈Values( f )
¦
S EntropyE (SSS )v 5.60)( where Values( f = set of all possible values of featuref ) fSv= subset of S for which feature f has valuef v.
Basic algorithms (e.g., C4.5 of Quinlan, 1993) construct the trees in a
Rule and tree learning algorithms were the first algorithms that have been used in information extraction, and they are still popular learning techniques for information extraction. The factors that play a role in their popularity are their expressive power, which makes them compatible with human-engineered knowledge rules and their easy interaction with other knowledge resources. Because of their greedy nature the algorithms usu-ally perform better when the feature set is limited. Information extraction tasks sometimes naturally can make use of a limited set of features that ex-hibit some dependencies between the features (e.g., in coreference resolu-tion).
Induction of rules and trees was a very popular information extraction technique in the 1990s. It has been applied among others to information be a popular and successful technique in coreference resolution (McCarthy