Decision Tree Decision Tree Algorithm

## Classification and Regression Tree (C&RT)

function

### DecisionTree(data D = {(x _{n}

,y_{n}

)}^{N} _{n=1}

)
if### termination criteria met

return

### base hypothesis g t

(x) else ...### 2

split D to### C

parts### D _{c}

= {(x_{n}

,y_{n}

) :### b(x _{n} )

=c}
### two simple choices

### • C

=2 (binary tree)### • g _{t}

(x) = E_{in}

-optimal### constant

disclaimer:

**C&RT**

here is based on**selected components**

of**CART** ^{TM} **of California Statistical Software**

## Classification and Regression Tree (C&RT)

function

### DecisionTree(data D = {(x _{n}

,y_{n}

)}^{N} _{n=1}

)
if### termination criteria met

return

### base hypothesis g t

(x) else ...### 2

split D to### C

parts### D _{c}

= {(x_{n}

,y_{n}

) :### b(x _{n} )

=c}
### two simple choices

### • C

=2 (binary tree)### • g _{t}

(x) = E_{in}

-optimal### constant

### • binary/multiclass classification (0/1 error): majority of {y

n### }

### • regression (squared error): average of {y

n### }

disclaimer:

**C&RT**

here is based on**selected components**

of**CART** ^{TM} **of California Statistical Software**

## Branching in C&RT: Purifying

function

### DecisionTree(data D = {(x _{n}

,y_{n}

)}^{N} _{n=1}

)
if### termination criteria met

return

### base hypothesis g _{t}

(x) = E_{in}

-optimal### constant

else ...### 1

learn### branching criteria b(x)

### 2

split D to### 2

parts### D _{c}

= {(x_{n}

,y_{n}

) :### b(x _{n} )

=c}
### more simple choices

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D _{c}

with h)
### C&RT: **bi-branching**

by**purifying**

## Impurity Functions

### by E in of optimal constant

### •

regression error:### impurity(D) = 1 N

N

### X

n=1

### (y

n### − y ¯ )

^{2}

with

### y ¯

=### average

of {y_{n}

}
### •

classification error:### impurity(D) = 1 N

N

### X

n=1

### Jy

n### 6= y

^{∗}

### K

with### y ^{∗}

=### majority

of {y_{n}

}
### for classification

### 1 − X

k =1

n=1

### Jy

n### = k K N

—all k considered together

### •

classification error:### 1 − max

1≤k ≤K

### P

Nn=1

### Jy

n### = k K N

—optimal

### k = y ^{∗}

only
**popular**

choices: **Gini**

for classification,
**regression error**

for regression
## Termination in C&RT

function

### DecisionTree(data D = {(x _{n}

,y### n

)}^{N} _{n=1}

)
if### termination criteria met

return

### base hypothesis g _{t}

(x) = E_{in}

-optimal### constant

else ...### 1

learn### branching criteria

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D c

with h)### ‘forced’ to terminate when

### •

all### y _{n} the same: impurity

= 0 =⇒### g _{t}

(x) =### y _{n}

### •

all**x** _{n} the same: **no decision stumps**

### C&RT: **fully-grown tree**

with### constant leaves

that come from**bi-branching**

by**purifying**

## Termination in C&RT

function

### DecisionTree(data D = {(x _{n}

,y### n

)}^{N} _{n=1}

)
if### termination criteria met

return

### base hypothesis g _{t}

(x) = E_{in}

-optimal### constant

else ...### 1

learn### branching criteria

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D c

with h)### ‘forced’ to terminate when

### •

all### y _{n} the same: impurity

= 0 =⇒### g _{t}

(x) =### y _{n}

### C&RT: **fully-grown tree**

with### constant leaves

that come from**bi-branching**

by**purifying**

function

### DecisionTree(data D = {(x _{n}

,y### n

)}^{N} _{n=1}

)
if### termination criteria met

return

### base hypothesis g _{t}

(x) = E_{in}

-optimal### constant

else ...### 1

learn### branching criteria

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D c

with h)### ‘forced’ to terminate when

### •

all### y _{n} the same: impurity

= 0 =⇒### g _{t}

(x) =### y _{n}

### •

all**x** _{n} the same: **no decision stumps**

## Fun Time

For the Gini index, 1 −P

### K k =1

_{P}

_{N}

n=1

### Jy

n### =k K N

### 2

. Consider K = 2, and let µ =

^{N} _{N}

^{1}, where N

_{1}

is the number of examples with y### n

=1. Which of the following formula of µ equals the Gini index in this case?### 1

2µ(1 − µ)### 2

2µ^{2}

(1 − µ)
### 3

2µ(1 − µ)^{2}

### 4

2µ^{2}

(1 − µ)^{2}

### Reference Answer: 1

Simplify 1 − (µ

^{2}

+ (1 − µ)^{2}

)and the answer
should pop up.
## Basic C&RT Algorithm

function

### DecisionTree

data D = {(x_{n}

,y_{n}

)}^{N} _{n=1}

if### cannot branch anymore

return

### g t

(x) = E_{in}

-optimal### constant

else### 1

learn### branching criteria

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D _{c}

with h)
### 2

split D to### 2

parts### D _{c}

= {(x### n

,y### n

) :### b(x n )

=c}### 3

build sub-tree### G c

←### DecisionTree(D _{c}

)
### 4

return### G(x) =

### 2

P

### c=1

J

### b(x)

=cK### G _{c}

(x)
easily handle binary classification, regression, &

**multi-class classification**

function

### DecisionTree

data D = {(x_{n}

,y_{n}

)}^{N} _{n=1}

if### cannot branch anymore

return

### g t

(x) = E_{in}

-optimal### constant

else### 1

learn### branching criteria

### b(x) =

argmin### decision stumps h(x) 2

X

### c=1

|D

_{c}

with h| ·### impurity(D _{c}

with h)
### 2

split D to### 2

parts### D _{c}

= {(x### n

,y### n

) :### b(x n )

=c}### 3

build sub-tree### G c

←### DecisionTree(D _{c}

)
### 4

return### G(x) =

### 2

P

### c=1

J

### b(x)

=cK### G _{c}

(x)
## Regularization by Pruning

### fully-grown tree: E _{in} (G) = 0

if all**x** n

different
but

**overfit**

(large E_{out}

) because**low-level trees built with small D** _{c}

### •

need a**regularizer, say,** Ω(G) = NumberOfLeaves(G)

### •

want**regularized** decision tree:

argmin
### all possible G

### E _{in}

(G) + λΩ(G)
—called

**pruned** decision tree

### •

cannot enumerate### all possible G

computationally:—often consider only

### • G

^{(0)}

### = fully-grown tree

### • G

^{(i)}

### = argmin

_{G}

### E

in### (G) such that G is **one-leaf removed** from G

^{(i−1)}

systematic

### choice of λ? **validation**

## Regularization by Pruning

### fully-grown tree: E _{in} (G) = 0

if all**x** n

different
but

**overfit**

(large E_{out}

) because**low-level trees built with small D** _{c}

### •

need a**regularizer, say,** Ω(G) = NumberOfLeaves(G)

### •

want**regularized** decision tree:

argmin
### all possible G

### E _{in}

(G) + λΩ(G)
—called

**pruned** decision tree

### •

cannot enumerate### all possible G

computationally:—often consider only

### • G

^{(0)}

### = fully-grown tree

### • G

^{(i)}

### = argmin

_{G}

### E

in### (G) such that G is **one-leaf removed** from G

^{(i−1)}

systematic

### choice of λ? **validation**

## Regularization by Pruning

### fully-grown tree: E _{in} (G) = 0

if all**x** n

different
but

**overfit**

(large E_{out}

) because**low-level trees built with small D** _{c}

### •

need a**regularizer, say,** Ω(G) = NumberOfLeaves(G)

### •

want**regularized** decision tree:

argmin

### all possible G

### E _{in}

(G) + λΩ(G)
—called

**pruned** decision tree

### •

cannot enumerate### all possible G

computationally:—often consider only

### • G

^{(0)}

### = fully-grown tree

### • G

^{(i)}

### = argmin

_{G}

### E

in### (G) such that G is **one-leaf removed** from G

^{(i−1)}

systematic

### choice of λ? **validation**

## Regularization by Pruning

### fully-grown tree: E _{in} (G) = 0

if all**x** n

different
but

**overfit**

(large E_{out}

) because**low-level trees built with small D** _{c}

### •

need a**regularizer, say,** Ω(G) = NumberOfLeaves(G)

### •

want**regularized** decision tree:

argmin

### all possible G

### E _{in}

(G) + λΩ(G)
—called

**pruned** decision tree

### •

cannot enumerate### all possible G

computationally:—often consider only

### • G

^{(0)}

### = fully-grown tree

### • G

^{(i)}

### = argmin

_{G}

### E

in### (G) such that G is **one-leaf removed** from G

^{(i−1)}

systematic

### choice of λ? **validation**

## Regularization by Pruning

### fully-grown tree: E _{in} (G) = 0

if all**x** n

different
but

**overfit**

(large E_{out}

) because**low-level trees built with small D** _{c}

### •

need a**regularizer, say,** Ω(G) = NumberOfLeaves(G)

### •

want**regularized** decision tree:

argmin

### all possible G

### E _{in}

(G) + λΩ(G)
—called

**pruned** decision tree

### •

cannot enumerate### all possible G

computationally:—often consider only

### • G

^{(0)}

### = fully-grown tree

### • G

^{(i)}

### = argmin

_{G}

### E

_{in}

### (G) such that G is **one-leaf removed** from G

^{(i−1)}

