Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/22
Decision Tree Decision Tree Algorithm
Classification and Regression Tree (C&RT)
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) else ...2
split D toC
partsD c
= {(xn
,yn
) :b(x n )
=c}two simple choices
• C
=2 (binary tree)• g t
(x) = Ein
-optimalconstant
disclaimer:
C&RT
here is based onselected components
ofCART TM of California Statistical Software
Decision Tree Decision Tree Algorithm
Classification and Regression Tree (C&RT)
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) else ...2
split D toC
partsD c
= {(xn
,yn
) :b(x n )
=c}two simple choices
• C
=2 (binary tree)• g t
(x) = Ein
-optimalconstant
• binary/multiclass classification (0/1 error): majority of {y
n}
• regression (squared error): average of {y
n}
disclaimer:
C&RT
here is based onselected components
ofCART TM of California Statistical Software
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/22
Decision Tree Decision Tree Algorithm
Classification and Regression Tree (C&RT)
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) else ...2
split D toC
partsD c
= {(xn
,yn
) :b(x n )
=c}two simple choices
• C
=2 (binary tree)• g t
(x) = Ein
-optimalconstant
disclaimer:
C&RT
here is based onselected components
ofCART TM of California Statistical Software
Decision Tree Decision Tree Algorithm
Classification and Regression Tree (C&RT)
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) else ...2
split D toC
partsD c
= {(xn
,yn
) :b(x n )
=c}two simple choices
• C
=2 (binary tree)• g t
(x) = Ein
-optimalconstant
• binary/multiclass classification (0/1 error): majority of {y
n}
• regression (squared error): average of {y
n}
disclaimer:
C&RT
here is based onselected components
ofCART TM of California Statistical Software
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/22
Decision Tree Decision Tree Algorithm
Classification and Regression Tree (C&RT)
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) else ...2
split D toC
partsD c
= {(xn
,yn
) :b(x n )
=c}two simple choices
• C
=2 (binary tree)• g t
(x) = Ein
-optimalconstant
• binary/multiclass classification (0/1 error): majority of {y
n}
• regression (squared error): average of {y
n}
of
CART TM of California Statistical Software
Decision Tree Decision Tree Algorithm
Classification and Regression Tree (C&RT)
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) else ...2
split D toC
partsD c
= {(xn
,yn
) :b(x n )
=c}two simple choices
• C
=2 (binary tree)• g t
(x) = Ein
-optimalconstant
• binary/multiclass classification (0/1 error): majority of {y
n}
• regression (squared error): average of {y
n}
disclaimer:
C&RT
here is based onselected components
ofCART TM of California Statistical Software
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/22
Decision Tree Decision Tree Algorithm
Branching in C&RT: Purifying
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) = Ein
-optimalconstant
else ...1
learnbranching criteria b(x)
2
split D to2
partsD c
= {(xn
,yn
) :b(x n )
=c}more simple choices
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)C&RT: bi-branching
bypurifying
Decision Tree Decision Tree Algorithm
Branching in C&RT: Purifying
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) = Ein
-optimalconstant
else ...1
learnbranching criteria b(x)
2
split D to2
partsD c
= {(xn
,yn
) :b(x n )
=c}more simple choices
•
simple internal node forC = 2: {1, 2}-output decision stump
•
‘easier’ sub-tree: branch bypurifying
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)C&RT: bi-branching
bypurifying
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/22
Decision Tree Decision Tree Algorithm
Branching in C&RT: Purifying
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) = Ein
-optimalconstant
else ...1
learnbranching criteria b(x)
2
split D to2
partsD c
= {(xn
,yn
) :b(x n )
=c}more simple choices
•
simple internal node forC = 2: {1, 2}-output decision stump
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)C&RT: bi-branching
bypurifying
Decision Tree Decision Tree Algorithm
Branching in C&RT: Purifying
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) = Ein
-optimalconstant
else ...1
learnbranching criteria b(x)
2
split D to2
partsD c
= {(xn
,yn
) :b(x n )
=c}more simple choices
•
simple internal node forC = 2: {1, 2}-output decision stump
•
‘easier’ sub-tree: branch bypurifying
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)C&RT: bi-branching
bypurifying
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/22
Decision Tree Decision Tree Algorithm
Branching in C&RT: Purifying
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) = Ein
-optimalconstant
else ...1
learnbranching criteria b(x)
2
split D to2
partsD c
= {(xn
,yn
) :b(x n )
=c}more simple choices
•
simple internal node forC = 2: {1, 2}-output decision stump
•
‘easier’ sub-tree: branch bypurifying
b(x) =
argmin2
X|D
c
with h| ·impurity(D c
with h)Decision Tree Decision Tree Algorithm
Branching in C&RT: Purifying
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) = Ein
-optimalconstant
else ...1
learnbranching criteria b(x)
2
split D to2
partsD c
= {(xn
,yn
) :b(x n )
=c}more simple choices
•
simple internal node forC = 2: {1, 2}-output decision stump
•
‘easier’ sub-tree: branch bypurifying
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)C&RT: bi-branching
bypurifying
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/22
Decision Tree Decision Tree Algorithm
Impurity Functions
by E in of optimal constant
•
regression error:impurity(D) = 1 N
N
X
n=1
(y
n− y ¯ )
2with
y ¯
=average
of {yn
}•
classification error:impurity(D) = 1 N
N
X
n=1
Jy
n6= y
∗K
withy ∗
=majority
of {yn
}for classification
1 − X
k =1
n=1
Jy
n= k K N
—all k considered together
•
classification error:1 − max
1≤k ≤K
P
Nn=1
Jy
n= k K N
—optimal
k = y ∗
onlypopular
choices:Gini
for classification,regression error
for regressionDecision Tree Decision Tree Algorithm
Impurity Functions
by E in of optimal constant
•
regression error:impurity(D) = 1 N
N
X
n=1
(y
n− y ¯ )
2with
y ¯
=average
of {yn
}•
classification error:impurity(D) = 1 N
N
X
n=1
Jy
n6= y
∗K
withy ∗
=majority
of {yn
}for classification
•
Gini index:1 −
K
X
k =1
P
Nn=1
Jy
n= k K N
!
2—all k considered together
•
classification error:1 − max
1≤k ≤K
P
Nn=1
Jy
n= k K N
—optimal
k = y ∗
onlypopular
choices:Gini
for classification,regression error
for regressionHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/22
Decision Tree Decision Tree Algorithm
Impurity Functions
by E in of optimal constant
•
regression error:impurity(D) = 1 N
N
X
n=1
(y
n− y ¯ )
2with
y ¯
=average
of {yn
}•
classification error:impurity(D) = 1 N
N
X
n=1
Jy
n6= y
∗K
withy ∗
=majority
of {yn
}for classification
1 − X
k =1
n=1
Jy
n= k K N
—all k considered together
•
classification error:1 − max
1≤k ≤K
P
Nn=1
Jy
n= k K N
—optimal
k = y ∗
onlypopular
choices:Gini
for classification,regression error
for regressionDecision Tree Decision Tree Algorithm
Impurity Functions
by E in of optimal constant
•
regression error:impurity(D) = 1 N
N
X
n=1
(y
n− y ¯ )
2with
y ¯
=average
of {yn
}•
classification error:impurity(D) = 1 N
N
X
n=1
Jy
n6= y
∗K
withy ∗
=majority
of {yn
}for classification
•
Gini index:1 −
K
X
k =1
P
Nn=1
Jy
n= k K N
!
2—all k considered together
•
classification error:1 − max
1≤k ≤K
P
Nn=1
Jy
n= k K N
—optimal
k = y ∗
onlypopular
choices:Gini
for classification,regression error
for regressionHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/22
Decision Tree Decision Tree Algorithm
Impurity Functions
by E in of optimal constant
•
regression error:impurity(D) = 1 N
N
X
n=1
(y
n− y ¯ )
2with
y ¯
=average
of {yn
}•
classification error:impurity(D) = 1 N
N
X
n=1
Jy
n6= y
∗K
withy ∗
=majority
of {yn
}for classification
1 − X
k =1
n=1
Jy
n= k K N
—all k considered together
•
classification error:1 − max
1≤k ≤K
P
Nn=1
Jy
n= k K N
—optimal
k = y ∗
onlypopular
choices:Gini
for classification,regression error
for regressionDecision Tree Decision Tree Algorithm
Impurity Functions
by E in of optimal constant
•
regression error:impurity(D) = 1 N
N
X
n=1
(y
n− y ¯ )
2with
y ¯
=average
of {yn
}•
classification error:impurity(D) = 1 N
N
X
n=1
Jy
n6= y
∗K
withy ∗
=majority
of {yn
}for classification
•
Gini index:1 −
K
X
k =1
P
Nn=1
Jy
n= k K N
!
2—all k considered together
•
classification error:1 − max
1≤k ≤K
P
Nn=1
Jy
n= k K N
—optimal
k = y ∗
onlypopular
choices:Gini
for classification,regression error
for regressionHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/22
by E in of optimal constant
•
regression error:impurity(D) = 1 N
N
X
n=1
(y
n− y ¯ )
2with
y ¯
=average
of {yn
}•
classification error:impurity(D) = 1 N
N
X
n=1
Jy
n6= y
∗K
withy ∗
=majority
of {yn
}for classification
•
Gini index:1 −
K
X
k =1
P
Nn=1
Jy
n= k K N
!
2—all k considered together
•
classification error:1 − max
1≤k ≤K
P
Nn=1
Jy
n= k K N
—optimal
k = y ∗
onlyDecision Tree Decision Tree Algorithm
Termination in C&RT
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) = Ein
-optimalconstant
else ...1
learnbranching criteria
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)‘forced’ to terminate when
•
ally n the same: impurity
= 0 =⇒g t
(x) =y n
•
allx n the same: no decision stumps
C&RT: fully-grown tree
withconstant leaves
that come frombi-branching
bypurifying
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/22
Decision Tree Decision Tree Algorithm
Termination in C&RT
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) = Ein
-optimalconstant
else ...1
learnbranching criteria
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)‘forced’ to terminate when
•
ally n the same: impurity
= 0 =⇒g t
(x) =y n
C&RT: fully-grown tree
withconstant leaves
that come frombi-branching
bypurifying
Decision Tree Decision Tree Algorithm
Termination in C&RT
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) = Ein
-optimalconstant
else ...1
learnbranching criteria
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)‘forced’ to terminate when
•
ally n the same: impurity
= 0 =⇒g t
(x) =y n
•
allx n the same: no decision stumps
C&RT: fully-grown tree
withconstant leaves
that come frombi-branching
bypurifying
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/22
function
DecisionTree(data D = {(x n
,yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) = Ein
-optimalconstant
else ...1
learnbranching criteria
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)‘forced’ to terminate when
•
ally n the same: impurity
= 0 =⇒g t
(x) =y n
•
allx n the same: no decision stumps
Decision Tree Decision Tree Algorithm
Fun Time
For the Gini index, 1 −P
K k =1
P
Nn=1
Jy
n=k K N
2
. Consider K = 2, and let µ =
N N
1, where N1
is the number of examples with yn
=1. Which of the following formula of µ equals the Gini index in this case?1
2µ(1 − µ)2
2µ2
(1 − µ)3
2µ(1 − µ)2
4
2µ2
(1 − µ)2
Reference Answer: 1
Simplify 1 − (µ
2
+ (1 − µ)2
)and the answer should pop up.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/22
For the Gini index, 1 −P
K k =1
P
Nn=1
Jy
n=k K N
2
. Consider K = 2, and let µ =
N N
1, where N1
is the number of examples with yn
=1. Which of the following formula of µ equals the Gini index in this case?1
2µ(1 − µ)2
2µ2
(1 − µ)3
2µ(1 − µ)2
4
2µ2
(1 − µ)2
Reference Answer: 1
Simplify 1 − (µ
2
+ (1 − µ)2
)and the answerDecision Tree Decision Tree Heuristics in C&RT
Basic C&RT Algorithm
function
DecisionTree
data D = {(xn
,yn
)}N n=1
ifcannot branch anymore
return
g t
(x) = Ein
-optimalconstant
else1
learnbranching criteria
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)2
split D to2
partsD c
= {(xn
,yn
) :b(x n )
=c}3
build sub-treeG c
←DecisionTree(D c
)4
returnG(x) =
2
P
c=1
J
b(x)
=cKG c
(x)easily handle binary classification, regression, &
multi-class classification
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/22
function
DecisionTree
data D = {(xn
,yn
)}N n=1
ifcannot branch anymore
return
g t
(x) = Ein
-optimalconstant
else1
learnbranching criteria
b(x) =
argmindecision stumps h(x) 2
X
c=1
|D
c
with h| ·impurity(D c
with h)2
split D to2
partsD c
= {(xn
,yn
) :b(x n )
=c}3
build sub-treeG c
←DecisionTree(D c
)4
returnG(x) =
2
P
c=1
J
b(x)
=cKG c
(x)Decision Tree Decision Tree Heuristics in C&RT
Regularization by Pruning
fully-grown tree: E in (G) = 0
if allx n
differentbut
overfit
(large Eout
) becauselow-level trees built with small D c
•
need aregularizer, say, Ω(G) = NumberOfLeaves(G)
•
wantregularized decision tree:
argminall possible G
E in
(G) + λΩ(G)—called
pruned decision tree
•
cannot enumerateall possible G
computationally:—often consider only
• G
(0)= fully-grown tree
• G
(i)= argmin
GE
in(G) such that G is one-leaf removed from G
(i−1)systematic
choice of λ? validation
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/22
Decision Tree Decision Tree Heuristics in C&RT
Regularization by Pruning
fully-grown tree: E in (G) = 0
if allx n
differentbut
overfit
(large Eout
) becauselow-level trees built with small D c
•
need aregularizer, say, Ω(G) = NumberOfLeaves(G)
•
wantregularized decision tree:
argminall possible G
E in
(G) + λΩ(G)—called
pruned decision tree
•
cannot enumerateall possible G
computationally:—often consider only
systematic
choice of λ? validation
Decision Tree Decision Tree Heuristics in C&RT
Regularization by Pruning
fully-grown tree: E in (G) = 0
if allx n
differentbut
overfit
(large Eout
) becauselow-level trees built with small D c
•
need aregularizer, say, Ω(G) = NumberOfLeaves(G)
•
wantregularized decision tree:
argminall possible G
E in
(G) + λΩ(G)—called
pruned decision tree
•
cannot enumerateall possible G
computationally:—often consider only
• G
(0)= fully-grown tree
• G
(i)= argmin
GE
in(G) such that G is one-leaf removed from G
(i−1)systematic
choice of λ? validation
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/22
Decision Tree Decision Tree Heuristics in C&RT
Regularization by Pruning
fully-grown tree: E in (G) = 0
if allx n
differentbut
overfit
(large Eout
) becauselow-level trees built with small D c
•
need aregularizer, say, Ω(G) = NumberOfLeaves(G)
•
wantregularized decision tree:
argmin
all possible G
E in
(G) + λΩ(G)—called
pruned decision tree
•
cannot enumerateall possible G
computationally:—often consider only
systematic
choice of λ? validation
Decision Tree Decision Tree Heuristics in C&RT
Regularization by Pruning
fully-grown tree: E in (G) = 0
if allx n
differentbut
overfit
(large Eout
) becauselow-level trees built with small D c
•
need aregularizer, say, Ω(G) = NumberOfLeaves(G)
•
wantregularized decision tree:
argmin
all possible G
E in
(G) + λΩ(G)—called
pruned decision tree
•
cannot enumerateall possible G
computationally:—often consider only
• G
(0)= fully-grown tree
• G
(i)= argmin
GE
in(G) such that G is one-leaf removed from G
(i−1)systematic
choice of λ? validation
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/22
Decision Tree Decision Tree Heuristics in C&RT
Regularization by Pruning
fully-grown tree: E in (G) = 0
if allx n
differentbut
overfit
(large Eout
) becauselow-level trees built with small D c
•
need aregularizer, say, Ω(G) = NumberOfLeaves(G)
•
wantregularized decision tree:
argmin
all possible G
E in
(G) + λΩ(G)—called
pruned decision tree
•
cannot enumerateall possible G
computationally:—often consider only
systematic
choice of λ? validation
Decision Tree Decision Tree Heuristics in C&RT
Regularization by Pruning
fully-grown tree: E in (G) = 0
if allx n
differentbut
overfit
(large Eout
) becauselow-level trees built with small D c
•
need aregularizer, say, Ω(G) = NumberOfLeaves(G)
•
wantregularized decision tree:
argmin
all possible G
E in
(G) + λΩ(G)—called
pruned decision tree
•
cannot enumerateall possible G
computationally:—often consider only
• G
(0)= fully-grown tree
• G
(i)= argmin
GE
in(G) such that G is one-leaf removed from G
(i−1)systematic
choice of λ? validation
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/22
Decision Tree Decision Tree Heuristics in C&RT
Regularization by Pruning
fully-grown tree: E in (G) = 0
if allx n
differentbut
overfit
(large Eout
) becauselow-level trees built with small D c
•
need aregularizer, say, Ω(G) = NumberOfLeaves(G)
•
wantregularized decision tree:
argmin
all possible G
E in
(G) + λΩ(G)—called
pruned decision tree
•
cannot enumerateall possible G
computationally:—often consider only
• G
(0)= fully-grown tree
• G
(i)= argmin
GE
in(G) such that G is one-leaf removed from G
(i−1)Decision Tree Decision Tree Heuristics in C&RT
Regularization by Pruning
fully-grown tree: E in (G) = 0
if allx n
differentbut
overfit
(large Eout
) becauselow-level trees built with small D c
•
need aregularizer, say, Ω(G) = NumberOfLeaves(G)
•
wantregularized decision tree:
argmin
all possible G
E in
(G) + λΩ(G)—called
pruned decision tree
•
cannot enumerateall possible G
computationally:—often consider only
• G
(0)= fully-grown tree
• G
(i)= argmin
GE
in(G) such that G is one-leaf removed from G
(i−1)systematic