## Machine Learning Techniques

## ( 機器學習技巧)

### Lecture 9: Decision Tree

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

## Agenda

### Lecture 9: Decision Tree

### Decision Tree Hypothesis

### Decision Tree Algorithm

### Decision Tree in Practice

### Decision Tree in Action

Decision Tree Decision Tree Hypothesis

## What We Have Done

blending: aggregate

### after getting g _{t}

;
learning: aggregate### as well as getting g t

aggregation type

### blending learning

uniform voting/averaging### Bagging

non-uniform linear

### AdaBoost

**conditional**

stacking **Decision Tree**

**decision tree: a traditional learning model that**

realizes**conditional aggregation**

## Decision Tree for Playing Golf

G(x) =

### T

X

### t=1

### q t

(x) ·### g t

(x)### • **base hypothesis g** _{t}

(x):
leaf at end of path t, a

**constant**

here
### • **condition q** _{t}

(x):
**Jis x on path t ?K**

### •

usually with**simple** **internal nodes**

decision tree: arguably one of the most

**human-mimicking models**

Decision Tree Decision Tree Hypothesis

## Recursive View of Decision Tree

Path View: G(x) =P

### T

### t=1 **Jx on path t K**

·### leaf _{t} (x) Recursive View

### G(x) =

### C

X

### c=1

**Jb(x) = c K**

·### G _{c}

(x)
### • G(x): full-tree

hypothesis### • b(x): branching criteria

### • G c

(x):### sub-tree

hypothesis at the c-th branch### tree

= (root,### subtrees), just like what

**your data structure instructor would say :-)**

## Disclaimers about Decision Tree

### Usefulness

### •

human-explainable:**widely** **used**

in business/medical
data analysis
### •

simple:**even freshmen can** **implement one :-)**

### •

efficient in prediction and**training**

### However...

### •

heuristic:mostly

**little theoretical**

explanations
### •

heuristics:‘heuristicsselection’

confusing to beginners

### •

arguably no single**representative algorithm**

decision tree: mostly

**heuristic**

**but useful**

on its own
Decision Tree Decision Tree Hypothesis

## Fun Time

## A Basic Decision Tree Algorithm

### G(x) =

### C

P

### c=1

J### b(x)

=cK### G c

(x) function### DecisionTree

data D = {(x### n

,y### n

)}^{N} _{n=1}

if### termination criteria met

return

### base hypothesis g _{t}

(x)
else
### 1

learn### branching criteria b(x)

### 2

split D to### C

parts### D _{c}

= {(x### n

,y### n

) :### b(x n )

=c}### 3

build sub-tree### G c

←### DecisionTree(D _{c}

)
### 4

return### G(x) =

### C

P

### c=1

J### b(x)

=cK### G c

(x)four choices:

### number of branches, branching

### criteria, termination criteria, & base hypothesis

Decision Tree Decision Tree Algorithm

## Classification and Regression Tree (C&RT)

function

### DecisionTree(data D = {(x _{n}

,y_{n}

)}^{N} _{n=1}

)
if### termination criteria met

return

### base hypothesis g t

(x) else ...### 2

split D to### C

parts### D _{c}

= {(x_{n}

,y_{n}

) :### b(x _{n} )

=c}
### two simple choices

### • C

=2 (binary tree)### • g _{t}

(x) = E_{in}

-optimal### constant

### • binary/multiclass classification (0/1 error): majority of {y

n### }

### • regression (squared error): average of {y

n### }

disclaimer:

**C&RT**

here is based on**selected components**

of**CART** ^{TM} **of California Statistical Software**

## Branching in C&RT: Purifying

function

### DecisionTree(data D = {(x _{n}

,y_{n}

)}^{N} _{n=1}

)
if### termination criteria met

return

### base hypothesis g _{t}

(x) = E_{in}

-optimal### constant

else ...### 1

learn### branching criteria b(x)

### 2

split D to### 2

parts### D _{c}

= {(x_{n}

,y_{n}

) :### b(x _{n} )

=c}
### more simple choices

### •

simple internal node for### C = 2: **{1, 2}-output decision stump**

### •

‘easier’ subtree: branch by**purifying**

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D _{c}

with h)
### C&RT: **bi-branching**

by**purifying**

Decision Tree Decision Tree Algorithm

## Impurity Functions

### by E in of optimal constant

### •

regression error:### impurity(D) = 1 N

N

### X

n=1

### (y

n### − y ¯ )

^{2}

with

### y ¯

=### average

of {y_{n}

}
### •

classification error:### impurity(D) = 1 N

N

### X

n=1

### Jy

n### = y

^{∗}

### K

with### y ^{∗}

=### majority

of {y_{n}

}
### for classification

### •

Gini index:### 1 − 1 K

K

### X

k =1

### P

Nn=1

### Jy

n### = k K N

### !

2—all k considered together

### •

classification error:### 1 − max

1≤k ≤K

### P

Nn=1

### Jy

n### = k K N

—optimal

### k = y ^{∗}

only
**popular**

choices: **Gini**

for classification,
**regression error**

for regression
## Termination in C&RT

function

### DecisionTree(data D = {(x _{n}

,y### n

)}^{N} _{n=1}

)
if### termination criteria met

return

### base hypothesis g _{t}

(x) = E_{in}

-optimal### constant

else ...### 1

learn### branching criteria

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D c

with h)### ‘forced’ to terminate when

### •

all### y _{n} the same: impurity

= 0 =⇒### g _{t}

(x) =### y _{n}

### •

all**x** _{n} the same: **no decision stumps**

### C&RT: **fully-grown tree**

with### constant leaves

that come from**bi-branching**

by**purifying**

Decision Tree Decision Tree Algorithm

## Fun Time

## Basic C&RT Algorithm

function

### DecisionTree

data D = {(x_{n}

,y_{n}

)}^{N} _{n=1}

if### cannot branch anymore

return

### g t

(x) = E_{in}

-optimal### constant

else### 1

learn### branching criteria

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D _{c}

with h)
### 2

split D to### 2

parts### D _{c}

= {(x### n

,y### n

) :### b(x n )

=c}### 3

build sub-tree### G c

←### DecisionTree(D _{c}

)
### 4

return### G(x) =

### 2

P

### c=1

J

### b(x)

=cK### G _{c}

(x)
easily handle binary classification, regression, &

**multi-class classification**

Decision Tree Decision Tree in Practice

## Regularization by Pruning

### fully-grown tree: E _{in} (G) = 0

if all**x** n

different
but

**overfit**

(large E_{out}

) because**low-level trees built with small D** _{c}

### •

need a**regularizer, say,** Ω(G) = NumberOfLeaves(G)

### •

want**regularized** decision tree:

argmin

### all possible G

### E _{in}

(G) + λΩ(G)
—called

**pruned** decision tree

### •

cannot enumerate### all possible G

computationally:—often consider only

### • G

^{(0)}

### = fully-grown tree

### • G

^{(i)}

### = argmin

_{G}

### E

_{in}

### (G) such that G is **one-leaf removed** from G

^{(i−1)}

systematic

### choice of λ? **validation**

## Branching on Categorical Features

### numerical features

blood pressure:130, 98, 115, 147, 120

### branching for numerical

decision stump### b(x) =

Jx### i

≤### θ

K + 1 with### θ

∈ R### categorical features

major symptom:fever, pain, tired, sweaty

### branching for categorical

decision subset### b(x) =

Jx### i

∈### S

K + 1 with### S ⊂ {1, 2, . . . , K }

### C&RT

(& general decision trees):handles

**categorical features easily**

Decision Tree Decision Tree in Practice

## Missing Features by Surrogate Branch

possible

### b(x) =

J### weight

≤ 50kgK if### weight

missing during prediction:### •

what would human do?### • go get weight

### • or, use threshold on height instead, because threshold on height ≈ threshold on weight

### • surrogate branch:

### • maintain surrogate branch b

_{1}

### (x), b

_{2}

### (x), . . . ≈ best branch b(x) during training

### • allows missing feature for b(x) during prediction by using surrogate instead

### C&RT: handles **missing features easily**

## A Simple Data Set

Decision Tree Decision Tree in Action

## A Simple Data Set

## A Simple Data Set

Decision Tree Decision Tree in Action

## A Simple Data Set

## A Simple Data Set

Decision Tree Decision Tree in Action

## A Simple Data Set

## A Simple Data Set

Decision Tree Decision Tree in Action

## A Simple Data Set

## A Simple Data Set

Decision Tree Decision Tree in Action

## A Simple Data Set

## A Complicated Data Set

Decision Tree Decision Tree in Action

## Practical Specialties of C&RT

### • human-explainable

### • multiclass

easily### • categorical

features easily### • missing

features easily—almost no other learning model share

### all such specialties,

except for**other decision trees**

**another**

popular decision tree algorithm:
**C4.5, with different** **choices of heuristics**

## Fun Time

Decision Tree Decision Tree in Action