Decision Tree Decision Tree Heuristics in C&RT
Regularization by Pruning
fully-grown tree: E in (G) = 0
if allx n
differentbut
overfit
(large Eout
) becauselow-level trees built with small D c
•
need aregularizer, say, Ω(G) = NumberOfLeaves(G)
•
wantregularized decision tree:
argmin
all possible G
E in
(G) + λΩ(G)—called
pruned decision tree
•
cannot enumerateall possible G
computationally:—often consider only
• G
(0)= fully-grown tree
• G
(i)= argmin
GE
in(G) such that G is one-leaf removed from G
(i−1)systematic
choice of λ?
fully-grown tree: E in (G) = 0
if allx n
differentbut
overfit
(large Eout
) becauselow-level trees built with small D c
•
need aregularizer, say, Ω(G) = NumberOfLeaves(G)
•
wantregularized decision tree:
argmin
all possible G
E in
(G) + λΩ(G)—called
pruned decision tree
•
cannot enumerateall possible G
computationally:—often consider only
• G
(0)= fully-grown tree
• G
(i)= argmin
GE
in(G) such that G is one-leaf removed from G
(i−1)Decision Tree Decision Tree Heuristics in C&RT
Branching on Categorical Features
numerical features
blood pressure:130, 98, 115, 147, 120
branching for numerical
decision stumpb(x) =
Jxi
≤θ
K + 1 withθ
∈ Rcategorical features
major symptom:fever, pain, tired, sweaty
branching for categorical
decision subsetb(x) =
Jxi
∈S
K + 1 withS ⊂ {1, 2, . . . , K }
C&RT
(& general decision trees): handlescategorical features easily
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/22
Decision Tree Decision Tree Heuristics in C&RT
Branching on Categorical Features
numerical features
blood pressure:130, 98, 115, 147, 120
branching for numerical
decision stumpb(x) =
Jxi
≤θ
K + 1 withθ
∈ Rcategorical features
major symptom:fever, pain, tired, sweaty
branching for categorical
decision subsetwith
S ⊂ {1, 2, . . . , K }
C&RT
(& general decision trees): handlescategorical features easily
Decision Tree Decision Tree Heuristics in C&RT
Branching on Categorical Features
numerical features
blood pressure:130, 98, 115, 147, 120
branching for numerical
decision stumpb(x) =
Jxi
≤θ
K + 1 withθ
∈ Rcategorical features
major symptom:fever, pain, tired, sweaty
branching for categorical
decision subsetb(x) =
Jxi
∈S
K + 1 withS ⊂ {1, 2, . . . , K }
C&RT
(& general decision trees): handlescategorical features easily
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/22
Decision Tree Decision Tree Heuristics in C&RT
Branching on Categorical Features
numerical features
blood pressure:130, 98, 115, 147, 120
branching for numerical
decision stumpb(x) =
Jxi
≤θ
K + 1 withθ
∈ Rcategorical features
major symptom:fever, pain, tired, sweaty
branching for categorical
decision subsetwith
S ⊂ {1, 2, . . . , K }
C&RT
(& general decision trees): handlescategorical features easily
Decision Tree Decision Tree Heuristics in C&RT
Branching on Categorical Features
numerical features
blood pressure:130, 98, 115, 147, 120
branching for numerical
decision stumpb(x) =
Jxi
≤θ
K + 1 withθ
∈ Rcategorical features
major symptom:fever, pain, tired, sweaty
branching for categorical
decision subsetb(x) =
Jxi
∈S
K + 1 withS ⊂ {1, 2, . . . , K }
C&RT
(& general decision trees): handlescategorical features easily
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/22
numerical features
blood pressure:130, 98, 115, 147, 120
branching for numerical
decision stumpb(x) =
Jxi
≤θ
K + 1 withθ
∈ Rcategorical features
major symptom:fever, pain, tired, sweaty
branching for categorical
decision subsetb(x) =
Jxi
∈S
K + 1 withS ⊂ {1, 2, . . . , K }
C&RT
(& general decision trees):Decision Tree Decision Tree Heuristics in C&RT
Missing Features by Surrogate Branch
possible
b(x) =
Jweight
≤ 50kgKif
weight
missing during prediction:•
what would human do?• go get weight
• or, use threshold on height instead, because threshold on height ≈ threshold on weight
• surrogate branch:
• maintain surrogate branch b
1(x), b
2(x), . . . ≈ best branch b(x) during training
• allow missing feature for b(x) during prediction by using surrogate instead
C&RT: handles missing features easily
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/22
Decision Tree Decision Tree Heuristics in C&RT
Missing Features by Surrogate Branch
possible
b(x) =
Jweight
≤ 50kgK ifweight
missing during prediction:•
what would human do?threshold on height ≈ threshold on weight
• surrogate branch:
• maintain surrogate branch b
1(x), b
2(x), . . . ≈ best branch b(x) during training
• allow missing feature for b(x) during prediction by using surrogate instead
C&RT: handles missing features easily
Decision Tree Decision Tree Heuristics in C&RT
Missing Features by Surrogate Branch
possible
b(x) =
Jweight
≤ 50kgK ifweight
missing during prediction:•
what would human do?• go get weight
• or, use threshold on height instead, because threshold on height ≈ threshold on weight
• surrogate branch:
• maintain surrogate branch b
1(x), b
2(x), . . . ≈ best branch b(x) during training
• allow missing feature for b(x) during prediction by using surrogate instead
C&RT: handles missing features easily
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/22
Decision Tree Decision Tree Heuristics in C&RT
Missing Features by Surrogate Branch
possible
b(x) =
Jweight
≤ 50kgK ifweight
missing during prediction:•
what would human do?• go get weight
• or, use threshold on height instead, because threshold on height ≈ threshold on weight
• surrogate branch:
• allow missing feature for b(x) during prediction by using surrogate instead
C&RT: handles missing features easily
Decision Tree Decision Tree Heuristics in C&RT
Missing Features by Surrogate Branch
possible
b(x) =
Jweight
≤ 50kgK ifweight
missing during prediction:•
what would human do?• go get weight
• or, use threshold on height instead, because threshold on height ≈ threshold on weight
• surrogate branch:
• maintain surrogate branch b
1(x), b
2(x), . . . ≈ best branch b(x) during training
• allow missing feature for b(x) during prediction by using surrogate instead
C&RT: handles missing features easily
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/22
Decision Tree Decision Tree Heuristics in C&RT
Missing Features by Surrogate Branch
possible
b(x) =
Jweight
≤ 50kgK ifweight
missing during prediction:•
what would human do?• go get weight
• or, use threshold on height instead, because threshold on height ≈ threshold on weight
• surrogate branch:
• maintain surrogate branch b
1(x), b
2(x), . . . ≈ best branch b(x) during training
C&RT: handles missing features easily
Decision Tree Decision Tree Heuristics in C&RT
Missing Features by Surrogate Branch
possible
b(x) =
Jweight
≤ 50kgK ifweight
missing during prediction:•
what would human do?• go get weight
• or, use threshold on height instead, because threshold on height ≈ threshold on weight
• surrogate branch:
• maintain surrogate branch b
1(x), b
2(x), . . . ≈ best branch b(x) during training
• allow missing feature for b(x) during prediction by using surrogate instead
C&RT: handles missing features easily
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/22
possible
b(x) =
Jweight
≤ 50kgK ifweight
missing during prediction:•
what would human do?• go get weight
• or, use threshold on height instead, because threshold on height ≈ threshold on weight
• surrogate branch:
• maintain surrogate branch b
1(x), b
2(x), . . . ≈ best branch b(x) during training
• allow missing feature for b(x) during prediction by using surrogate
instead
Decision Tree Decision Tree Heuristics in C&RT
Fun Time
For a categorical branching criteria
b(x) =
Jxi
∈S
K + 1 withS = {1, 6}. Which of the following is the explanation of the criteria?
1
if i-th feature is of type 1 or type 6, branch to first sub-tree; else branch to second sub-tree2
if i-th feature is of type 1 or type 6, branch to second sub-tree;else branch to first sub-tree
3
if i-th feature is of type 1 and type 6, branch to second sub-tree;else branch to first sub-tree
4
if i-th feature is of type 1 and type 6, branch to first sub-tree; else branch to second sub-treeReference Answer: 2
Note that ‘∈ S’ is an ‘or’-style condition on the elements of S in human language.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22
For a categorical branching criteria
b(x) =
Jxi
∈S
K + 1 withS = {1, 6}. Which of the following is the explanation of the criteria?
1
if i-th feature is of type 1 or type 6, branch to first sub-tree; else branch to second sub-tree2
if i-th feature is of type 1 or type 6, branch to second sub-tree;else branch to first sub-tree
3
if i-th feature is of type 1 and type 6, branch to second sub-tree;else branch to first sub-tree
4
if i-th feature is of type 1 and type 6, branch to first sub-tree; else branch to second sub-treeReference Answer: 2
Decision Tree Decision Tree in Action
A Simple Data Set
C&RT AdaBoost-Stump
C&RT: ‘divide-and-conquer’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22
Decision Tree Decision Tree in Action
A Simple Data Set
C&RT: ‘divide-and-conquer’
Decision Tree Decision Tree in Action
A Simple Data Set
C&RT AdaBoost-Stump
C&RT: ‘divide-and-conquer’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22
Decision Tree Decision Tree in Action
A Simple Data Set
C&RT: ‘divide-and-conquer’
Decision Tree Decision Tree in Action
A Simple Data Set
C&RT AdaBoost-Stump
C&RT: ‘divide-and-conquer’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22
Decision Tree Decision Tree in Action
A Simple Data Set
C&RT: ‘divide-and-conquer’
Decision Tree Decision Tree in Action
A Simple Data Set
C&RT AdaBoost-Stump
C&RT: ‘divide-and-conquer’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22
Decision Tree Decision Tree in Action
A Simple Data Set
C&RT: ‘divide-and-conquer’
Decision Tree Decision Tree in Action
A Simple Data Set
C&RT AdaBoost-Stump
C&RT: ‘divide-and-conquer’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22
Decision Tree Decision Tree in Action
A Simple Data Set
C&RT: ‘divide-and-conquer’
Decision Tree Decision Tree in Action
A Simple Data Set
C&RT AdaBoost-Stump
C&RT: ‘divide-and-conquer’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22
Decision Tree Decision Tree in Action
A Complicated Data Set
C&RT AdaBoost-Stump
Decision Tree Decision Tree in Action
A Complicated Data Set
C&RT AdaBoost-Stump
C&RT: even more efficient than AdaBoost-Stump
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22
Decision Tree Decision Tree in Action
A Complicated Data Set
C&RT AdaBoost-Stump
Decision Tree Decision Tree in Action
A Complicated Data Set
C&RT AdaBoost-Stump
C&RT: even more efficient than AdaBoost-Stump
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22
C&RT AdaBoost-Stump
Decision Tree Decision Tree in Action
Practical Specialties of C&RT
• human-explainable
• multiclass
easily• categorical
features easily• missing
features easily• efficient
non-linear training (and testing)—almost no other learning model share
all such specialties,
except forother decision trees
another
popular decision tree algorithm:C4.5, with different choices of heuristics
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/22
Decision Tree Decision Tree in Action
Practical Specialties of C&RT
• human-explainable
• multiclass
easily• efficient
non-linear training (and testing)—almost no other learning model share
all such specialties,
except forother decision trees
another
popular decision tree algorithm:C4.5, with different choices of heuristics
Decision Tree Decision Tree in Action
Practical Specialties of C&RT
• human-explainable
• multiclass
easily• categorical
features easily• missing
features easily• efficient
non-linear training (and testing)—almost no other learning model share
all such specialties,
except forother decision trees
another
popular decision tree algorithm:C4.5, with different choices of heuristics
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/22
Decision Tree Decision Tree in Action
Practical Specialties of C&RT
• human-explainable
• multiclass
easily• categorical
features easily• missing
features easilyexcept for
other decision trees
another
popular decision tree algorithm:C4.5, with different choices of heuristics
Decision Tree Decision Tree in Action
Practical Specialties of C&RT
• human-explainable
• multiclass
easily• categorical
features easily• missing
features easily• efficient
non-linear training (and testing)—almost no other learning model share
all such specialties,
except forother decision trees
another
popular decision tree algorithm:C4.5, with different choices of heuristics
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/22
Decision Tree Decision Tree in Action
Practical Specialties of C&RT
• human-explainable
• multiclass
easily• categorical
features easily• missing
features easily• efficient
non-linear training (and testing)—almost no other learning model share
all such specialties,
except forother decision trees
Decision Tree Decision Tree in Action
Practical Specialties of C&RT
• human-explainable
• multiclass
easily• categorical
features easily• missing
features easily• efficient
non-linear training (and testing)—almost no other learning model share