## Machine Learning Techniques

## ( 機器學習技法)

### Lecture 9: Decision Tree

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

Decision Tree

## Roadmap

### 1 Embedding Numerous Features: Kernel Models

### 2

Combining Predictive Features: Aggregation Models### Lecture 8: Adaptive Boosting

**optimal re-weighting**

for diverse hypotheses
and adaptive**linear aggregation**

to
**boost ‘weak’ algorithms**

### Lecture 9: Decision Tree Decision Tree Hypothesis Decision Tree Algorithm

### Decision Tree Heuristics in C&RT Decision Tree in Action

### 3 Distilling Implicit Features: Extraction Models

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/22

## What We Have Done

blending: aggregate

### after getting g _{t}

;
learning: aggregate### as well as getting g t

aggregation type

### blending learning

uniform voting/averaging### Bagging

non-uniform linear

### AdaBoost

**conditional**

stacking **Decision Tree**

**decision tree: a traditional learning model that**

realizes**conditional aggregation**

Decision Tree Decision Tree Hypothesis

## Decision Tree for Watching MOOC Lectures

G(x) =

### T

X

### t=1

### q t

(x) ·### g t

(x)### • **base hypothesis g** _{t}

(x):
leaf at end of path t, a

**constant**

here
### • **condition q** _{t}

(x):
**Jis x on path t ?K**

### •

usually with**simple** **internal nodes**

**quitting** **time?**

**has a** **date?**

N

true

Y

false

<18:30

Y

between

**deadline?**

N

>2 days

Y

between

N

< −2 days

>21:30

decision tree: arguably one of the most

**human-mimicking models**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/22

## Recursive View of Decision Tree

Path View: G(x) =P

### T

### t=1 **Jx on path t K**

·### leaf _{t} (x)

**quitting**
**time?**

**has a**
**date?**

### N

true

### Y

false

<18:30

### Y

between

**deadline?**

### N

>2 days

### Y

between

### N

< −2 days

>21:30

### Recursive View G(x) =

### C

X

### c=1

**Jb(x) = c K**

·### G _{c}

(x)
### • G(x): full-tree

hypothesis### • b(x): branching criteria

### • G c

(x):### sub-tree

hypothesis at the c-th branch### tree

= (root,### sub-trees), just like what

**your data structure instructor would say :-)**

Decision Tree Decision Tree Hypothesis

## Disclaimers about Decision Tree

### Usefulness

### •

human-explainable:**widely** **used**

in business/medical
data analysis
### •

simple:**even freshmen can** **implement one :-)**

### •

efficient in prediction and**training**

### However...

### •

heuristic:mostly

**little theoretical**

explanations
### •

heuristics:‘heuristicsselection’

confusing to beginners

### •

arguably no single**representative algorithm**

decision tree: mostly

**heuristic** **but useful**

on its own
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22

Decision Tree Decision Tree Hypothesis

## Fun Time

The following C-like code can be viewed as a decision tree of three leaves.

### if (income > 100000) return true;

### else {

### if (debt > 50000) return false;

### else return true;

### }

What is the output of the tree for (income, debt) = (98765, 56789)?

### 1

true### 2

false### 3

98765### 4

56789### Reference Answer: 2

You can simply trace the code. The tree expresses a complicated boolean condition Jincome>100000 or debt ≤ 50000K.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/22

Decision Tree Decision Tree Hypothesis

## Fun Time

The following C-like code can be viewed as a decision tree of three leaves.

### if (income > 100000) return true;

### else {

### if (debt > 50000) return false;

### else return true;

### }

What is the output of the tree for (income, debt) = (98765, 56789)?

### 1

true### 2

false### 3

98765### 4

56789### Reference Answer: 2

You can simply trace the code. The tree expresses a complicated boolean condition Jincome>100000 or debt ≤ 50000K.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/22

## A Basic Decision Tree Algorithm

### G(x) =

### C

P

### c=1

J### b(x)

=cK### G c

(x) function### DecisionTree

data D = {(x### n

,y### n

)}^{N} _{n=1}

if### termination criteria met

return

### base hypothesis g _{t}

(x)
else
### 1

learn### branching criteria b(x)

### 2

split D to### C

parts### D _{c}

= {(x### n

,y### n

) :### b(x n )

=c}### 3

build sub-tree### G c

←### DecisionTree(D _{c}

)
### 4

return### G(x) =

### C

P

### c=1

J### b(x)

=cK### G c

(x)four choices:

### number of branches, branching

### criteria, termination criteria, & base hypothesis

Decision Tree Decision Tree Algorithm

## Classification and Regression Tree (C&RT)

function

### DecisionTree(data D = {(x _{n}

,y_{n}

)}^{N} _{n=1}

)
if### termination criteria met

return

### base hypothesis g t

(x) else ...### 2

split D to### C

parts### D _{c}

= {(x_{n}

,y_{n}

) :### b(x _{n} )

=c}
### two simple choices

### • C

=2 (binary tree)### • g _{t}

(x) = E_{in}

-optimal### constant

### • binary/multiclass classification (0/1 error): majority of {y

n### }

### • regression (squared error): average of {y

n### }

disclaimer:

**C&RT**

here is based on**selected components**

of**CART** ^{TM} **of California Statistical Software**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/22

## Branching in C&RT: Purifying

function

### DecisionTree(data D = {(x _{n}

,y_{n}

)}^{N} _{n=1}

)
if### termination criteria met

return

### base hypothesis g _{t}

(x) = E_{in}

-optimal### constant

else ...### 1

learn### branching criteria b(x)

### 2

split D to### 2

parts### D _{c}

= {(x_{n}

,y_{n}

) :### b(x _{n} )

=c}
### more simple choices

### •

simple internal node for### C = 2: **{1, 2}-output decision stump**

### •

‘easier’ sub-tree: branch by**purifying**

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D _{c}

with h)
### C&RT: **bi-branching**

by**purifying**

Decision Tree Decision Tree Algorithm

## Impurity Functions

### by E in of optimal constant

### •

regression error:### impurity(D) = 1 N

N

### X

n=1

### (y

n### − y ¯ )

^{2}

with

### y ¯

=### average

of {y_{n}

}
### •

classification error:### impurity(D) = 1 N

N

### X

n=1

### Jy

n### 6= y

^{∗}

### K

with### y ^{∗}

=### majority

of {y_{n}

}
### for classification

### •

Gini index:### 1 −

K

### X

k =1

### P

Nn=1

### Jy

n### = k K N

### !

2—all k considered together

### •

classification error:### 1 − max

1≤k ≤K

### P

Nn=1

### Jy

n### = k K N

—optimal

### k = y ^{∗}

only
**popular**

choices: **Gini**

for classification,
**regression error**

for regression
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/22

## Termination in C&RT

function

### DecisionTree(data D = {(x _{n}

,y### n

)}^{N} _{n=1}

)
if### termination criteria met

return

### base hypothesis g _{t}

(x) = E_{in}

-optimal### constant

else ...### 1

learn### branching criteria

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D c

with h)### ‘forced’ to terminate when

### •

all### y _{n} the same: impurity

= 0 =⇒### g _{t}

(x) =### y _{n}

### •

all**x** _{n} the same: **no decision stumps**

### C&RT: **fully-grown tree**

with### constant leaves

that come from**bi-branching**

by**purifying**

Decision Tree Decision Tree Algorithm

## Fun Time

For the Gini index, 1 −P

### K k =1

_{P}

_{N}

n=1

### Jy

n### =k K N

### 2

. Consider K = 2, and let µ =

^{N} _{N}

^{1}, where N

_{1}

is the number of examples with y### n

=1. Which of the following formula of µ equals the Gini index in this case?### 1

2µ(1 − µ)### 2

2µ^{2}

(1 − µ)
### 3

2µ(1 − µ)^{2}

### 4

2µ^{2}

(1 − µ)^{2}

### Reference Answer: 1

Simplify 1 − (µ

^{2}

+ (1 − µ)^{2}

)and the answer
should pop up.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/22

## Fun Time

For the Gini index, 1 −P

### K k =1

_{P}

_{N}

n=1

### Jy

n### =k K N

### 2

. Consider K = 2, and let µ =

^{N} _{N}

^{1}, where N

_{1}

is the number of examples with y### n

=1. Which of the following formula of µ equals the Gini index in this case?### 1

2µ(1 − µ)### 2

2µ^{2}

(1 − µ)
### 3

2µ(1 − µ)^{2}

### 4

2µ^{2}

(1 − µ)^{2}

### Reference Answer: 1

Simplify 1 − (µ

^{2}

+ (1 − µ)^{2}

)and the answer
should pop up.
Decision Tree Decision Tree Heuristics in C&RT

## Basic C&RT Algorithm

function

### DecisionTree

data D = {(x_{n}

,y_{n}

)}^{N} _{n=1}

if### cannot branch anymore

return

### g t

(x) = E_{in}

-optimal### constant

else### 1

learn### branching criteria

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D _{c}

with h)
### 2

split D to### 2

parts### D _{c}

= {(x### n

,y### n

) :### b(x n )

=c}### 3

build sub-tree### G c

←### DecisionTree(D _{c}

)
### 4

return### G(x) =

### 2

P

### c=1

J

### b(x)

=cK### G _{c}

(x)
easily handle binary classification, regression, &

**multi-class classification**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/22

## Regularization by Pruning

### fully-grown tree: E _{in} (G) = 0

if all**x** n

different
but

**overfit**

(large E_{out}

) because**low-level trees built with small D** _{c}

### •

need a**regularizer, say,** Ω(G) = NumberOfLeaves(G)

### •

want**regularized** decision tree:

argmin

### all possible G

### E _{in}

(G) + λΩ(G)
—called

**pruned** decision tree

### •

cannot enumerate### all possible G

computationally:—often consider only

### • G

^{(0)}

### = fully-grown tree

### • G

^{(i)}

### = argmin

_{G}

### E

_{in}

### (G) such that G is **one-leaf removed** from G

^{(i−1)}

systematic

### choice of λ? **validation**

Decision Tree Decision Tree Heuristics in C&RT

## Branching on Categorical Features

### numerical features

blood pressure:130, 98, 115, 147, 120

### branching for numerical

decision stump### b(x) =

Jx### i

≤### θ

K + 1 with### θ

∈ R### categorical features

major symptom:fever, pain, tired, sweaty

### branching for categorical

decision subset### b(x) =

Jx### i

∈### S

K + 1 with### S ⊂ {1, 2, . . . , K }

### C&RT

(& general decision trees):handles

**categorical features easily**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/22

## Missing Features by Surrogate Branch

possible

### b(x) =

J### weight

≤ 50kgK if### weight

missing during prediction:### •

what would human do?### • go get weight

### • or, use threshold on height instead, because threshold on height ≈ threshold on weight

### • surrogate branch:

### • maintain surrogate branch b

_{1}

### (x), b

_{2}

### (x), . . . ≈ best branch b(x) during training

### • allow missing feature for b(x) during prediction by using surrogate instead

### C&RT: handles **missing features easily**

Decision Tree Decision Tree Heuristics in C&RT

## Fun Time

For a categorical branching criteria

### b(x) =

Jx### i

∈### S

K + 1 withS = {1, 6}. Which of the following is the explanation of the criteria?

### 1

if i-th feature is of type 1 or type 6, branch to first sub-tree; else branch to second sub-tree### 2

if i-th feature is of type 1 or type 6, branch to second sub-tree;else branch to first sub-tree

### 3

if i-th feature is of type 1 and type 6, branch to second sub-tree;else branch to first sub-tree

### 4

if i-th feature is of type 1 and type 6, branch to first sub-tree; else branch to second sub-tree### Reference Answer: 2

Note that ‘∈ S’ is an ‘or’-style condition on the elements of S in human language.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22

## Fun Time

For a categorical branching criteria

### b(x) =

Jx### i

∈### S

K + 1 withS = {1, 6}. Which of the following is the explanation of the criteria?

### 1

if i-th feature is of type 1 or type 6, branch to first sub-tree; else branch to second sub-tree### 2

if i-th feature is of type 1 or type 6, branch to second sub-tree;else branch to first sub-tree

### 3

if i-th feature is of type 1 and type 6, branch to second sub-tree;else branch to first sub-tree

### 4

if i-th feature is of type 1 and type 6, branch to first sub-tree; else branch to second sub-tree### Reference Answer: 2

Note that ‘∈ S’ is an ‘or’-style condition on the elements of S in human language.

Decision Tree Decision Tree in Action

## A Simple Data Set

C&RT AdaBoost-Stump

**C&RT: ‘divide-and-conquer’**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22

Decision Tree Decision Tree in Action

## A Simple Data Set

C&RT AdaBoost-Stump

**C&RT: ‘divide-and-conquer’**

Decision Tree Decision Tree in Action

## A Simple Data Set

C&RT AdaBoost-Stump

**C&RT: ‘divide-and-conquer’**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22

Decision Tree Decision Tree in Action

## A Simple Data Set

C&RT AdaBoost-Stump

**C&RT: ‘divide-and-conquer’**

Decision Tree Decision Tree in Action

## A Simple Data Set

C&RT AdaBoost-Stump

**C&RT: ‘divide-and-conquer’**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22

Decision Tree Decision Tree in Action

## A Simple Data Set

C&RT AdaBoost-Stump

**C&RT: ‘divide-and-conquer’**

Decision Tree Decision Tree in Action

## A Simple Data Set

C&RT AdaBoost-Stump

**C&RT: ‘divide-and-conquer’**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22

Decision Tree Decision Tree in Action

## A Simple Data Set

C&RT AdaBoost-Stump

**C&RT: ‘divide-and-conquer’**

Decision Tree Decision Tree in Action

## A Simple Data Set

C&RT AdaBoost-Stump

**C&RT: ‘divide-and-conquer’**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22

Decision Tree Decision Tree in Action

## A Simple Data Set

C&RT AdaBoost-Stump

**C&RT: ‘divide-and-conquer’**

Decision Tree Decision Tree in Action

## A Simple Data Set

C&RT AdaBoost-Stump

**C&RT: ‘divide-and-conquer’**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22

## A Complicated Data Set

C&RT AdaBoost-Stump

**C&RT: even more efficient than**

**AdaBoost-Stump**

Decision Tree Decision Tree in Action

## Practical Specialties of C&RT

### • human-explainable

### • multiclass

easily### • categorical

features easily### • missing

features easily### • efficient

non-linear training (and testing)—almost no other learning model share

### all such specialties,

except for**other decision trees**

**another**

popular decision tree algorithm:
**C4.5, with different** **choices of heuristics**

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/22

Decision Tree Decision Tree in Action

## Fun Time

Which of the following is

**not**

a specialty of C&RT without pruning?
### 1

handles missing features easily### 2

produces explainable hypotheses### 3

achieves low E_{in}

### 4

achieves low E_{out}

### Reference Answer: 4

The first two choices are easy; the third comes from the fact that fully grown C&RT greedy minimizes E

_{in}

(almost always to 0). But as you
may imagine, overfitting may happen and E_{out}

may not always be low.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/22

Decision Tree Decision Tree in Action

## Fun Time

Which of the following is

**not**

a specialty of C&RT without pruning?
### 1

handles missing features easily### 2

produces explainable hypotheses### 3

achieves low E_{in}

### 4

achieves low E_{out} Reference Answer: 4

The first two choices are easy; the third comes from the fact that fully grown C&RT greedy minimizes E

_{in}

(almost always to 0). But as you
may imagine, overfitting may happen and E_{out}

may not always be low.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/22