Machine Learning Techniques
( 機器學習技巧)
Lecture 9: Decision Tree
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Agenda
Lecture 9: Decision Tree
Decision Tree Hypothesis
Decision Tree Algorithm
Decision Tree in Practice
Decision Tree in Action
Decision Tree Decision Tree Hypothesis
What We Have Done
blending: aggregate
after getting g t
; learning: aggregateas well as getting g t
aggregation type
blending learning
uniform voting/averagingBagging
non-uniform linear
AdaBoost
conditional
stackingDecision Tree
decision tree: a traditional learning model that
realizesconditional aggregation
Decision Tree for Playing Golf
G(x) =
T
X
t=1
q t
(x) ·g t
(x)• base hypothesis g t
(x):leaf at end of path t, a
constant
here• condition q t
(x):Jis x on path t ?K
•
usually withsimple internal nodes
decision tree: arguably one of the most
human-mimicking models
Decision Tree Decision Tree Hypothesis
Recursive View of Decision Tree
Path View: G(x) =P
T
t=1 Jx on path t K
·leaf t (x) Recursive View
G(x) =
C
X
c=1
Jb(x) = c K
·G c
(x)• G(x): full-tree
hypothesis• b(x): branching criteria
• G c
(x):sub-tree
hypothesis at the c-th branchtree
= (root,subtrees), just like what
your data structure instructor would say :-)
Disclaimers about Decision Tree
Usefulness
•
human-explainable:widely used
in business/medical data analysis•
simple:even freshmen can implement one :-)
•
efficient in prediction andtraining
However...
•
heuristic:mostly
little theoretical
explanations•
heuristics:‘heuristicsselection’
confusing to beginners
•
arguably no singlerepresentative algorithm
decision tree: mostly
heuristic
but useful
on its ownDecision Tree Decision Tree Hypothesis
Fun Time
A Basic Decision Tree Algorithm
G(x) =
C
P
c=1
Jb(x)
=cKG c
(x) functionDecisionTree
data D = {(xn
,yn
)}N n=1
iftermination criteria met
return
base hypothesis g t
(x) else1
learnbranching criteria b(x)
2
split D toC
partsD c
= {(xn
,yn
) :b(x n )
=c}3
build sub-treeG c
←DecisionTree(D c
)4
returnG(x) =
C
P
c=1
Jb(x)
=cKG c
(x)four choices:
number of branches, branching
criteria, termination criteria, & base hypothesis
Decision Tree Decision Tree Algorithm
Classification and Regression Tree (C&RT)
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) else ...2
split D toC
partsD c
= {(xn
,yn
) :b(x n )
=c}two simple choices
• C
=2 (binary tree)• g t
(x) = Ein
-optimalconstant
• binary/multiclass classification (0/1 error): majority of {y
n}
• regression (squared error): average of {y
n}
disclaimer:
C&RT
here is based onselected components
ofCART TM of California Statistical Software
Branching in C&RT: Purifying
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) = Ein
-optimalconstant
else ...1
learnbranching criteria b(x)
2
split D to2
partsD c
= {(xn
,yn
) :b(x n )
=c}more simple choices
•
simple internal node forC = 2: {1, 2}-output decision stump
•
‘easier’ subtree: branch bypurifying
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)C&RT: bi-branching
bypurifying
Decision Tree Decision Tree Algorithm
Impurity Functions
by E in of optimal constant
•
regression error:impurity(D) = 1 N
N
X
n=1
(y
n− y ¯ )
2with
y ¯
=average
of {yn
}•
classification error:impurity(D) = 1 N
N
X
n=1
Jy
n= y
∗K
withy ∗
=majority
of {yn
}for classification
•
Gini index:1 − 1 K
K
X
k =1
P
Nn=1
Jy
n= k K N
!
2—all k considered together
•
classification error:1 − max
1≤k ≤K
P
Nn=1
Jy
n= k K N
—optimal
k = y ∗
onlypopular
choices:Gini
for classification,regression error
for regressionTermination in C&RT
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) = Ein
-optimalconstant
else ...1
learnbranching criteria
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)‘forced’ to terminate when
•
ally n the same: impurity
= 0 =⇒g t
(x) =y n
•
allx n the same: no decision stumps
C&RT: fully-grown tree
withconstant leaves
that come frombi-branching
bypurifying
Decision Tree Decision Tree Algorithm
Fun Time
Basic C&RT Algorithm
function
DecisionTree
data D = {(xn
,yn
)}N n=1
ifcannot branch anymore
return
g t
(x) = Ein
-optimalconstant
else1
learnbranching criteria
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)2
split D to2
partsD c
= {(xn
,yn
) :b(x n )
=c}3
build sub-treeG c
←DecisionTree(D c
)4
returnG(x) =
2
P
c=1
J
b(x)
=cKG c
(x)easily handle binary classification, regression, &
multi-class classification
Decision Tree Decision Tree in Practice
Regularization by Pruning
fully-grown tree: E in (G) = 0
if allx n
differentbut
overfit
(large Eout
) becauselow-level trees built with small D c
•
need aregularizer, say, Ω(G) = NumberOfLeaves(G)
•
wantregularized decision tree:
argmin
all possible G
E in
(G) + λΩ(G)—called
pruned decision tree
•
cannot enumerateall possible G
computationally:—often consider only
• G
(0)= fully-grown tree
• G
(i)= argmin
GE
in(G) such that G is one-leaf removed from G
(i−1)systematic
choice of λ? validation
Branching on Categorical Features
numerical features
blood pressure:130, 98, 115, 147, 120
branching for numerical
decision stumpb(x) =
Jxi
≤θ
K + 1 withθ
∈ Rcategorical features
major symptom:fever, pain, tired, sweaty
branching for categorical
decision subsetb(x) =
Jxi
∈S
K + 1 withS ⊂ {1, 2, . . . , K }
C&RT
(& general decision trees):handles
categorical features easily
Decision Tree Decision Tree in Practice
Missing Features by Surrogate Branch
possible
b(x) =
Jweight
≤ 50kgK ifweight
missing during prediction:•
what would human do?• go get weight
• or, use threshold on height instead, because threshold on height ≈ threshold on weight
• surrogate branch:
• maintain surrogate branch b
1(x), b
2(x), . . . ≈ best branch b(x) during training
• allows missing feature for b(x) during prediction by using surrogate instead
C&RT: handles missing features easily
A Simple Data Set
Decision Tree Decision Tree in Action
A Simple Data Set
A Simple Data Set
Decision Tree Decision Tree in Action
A Simple Data Set
A Simple Data Set
Decision Tree Decision Tree in Action
A Simple Data Set
A Simple Data Set
Decision Tree Decision Tree in Action
A Simple Data Set
A Simple Data Set
Decision Tree Decision Tree in Action
A Simple Data Set
A Complicated Data Set
Decision Tree Decision Tree in Action
Practical Specialties of C&RT
• human-explainable
• multiclass
easily• categorical
features easily• missing
features easily—almost no other learning model share
all such specialties,
except forother decision trees
another
popular decision tree algorithm:C4.5, with different choices of heuristics
Fun Time
Decision Tree Decision Tree in Action