Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques

( 機器學習技巧)

Lecture 9: Decision Tree

Hsuan-Tien Lin (林軒田) [email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Agenda

Lecture 9: Decision Tree

Decision Tree Hypothesis

Decision Tree Algorithm

Decision Tree in Practice

Decision Tree in Action

(3)

Decision Tree Decision Tree Hypothesis

What We Have Done

blending: aggregate

after getting g _t

; learning: aggregate

as well as getting g t

aggregation type

blending learning

uniform voting/averaging

Bagging

non-uniform linear

AdaBoost

conditional

stacking

Decision Tree

decision tree: a traditional learning model that

realizes

conditional aggregation

(4)

Decision Tree for Playing Golf

G(x) =

T

X

t=1

q t

(x) ·

g t

(x)

• base hypothesis g _t

(x):

leaf at end of path t, a

constant

here

• condition q _t

(x):

Jis x on path t ?K

•

usually with

simple internal nodes

decision tree: arguably one of the most

human-mimicking models

(5)

Recursive View of Decision Tree

Path View: G(x) =P

T

t=1 Jx on path t K

·

leaf _t (x) Recursive View

G(x) =

C

X

c=1

Jb(x) = c K

·

G _c

(x)

• G(x): full-tree

hypothesis

• b(x): branching criteria

• G c

(x):

sub-tree

hypothesis at the c-th branch

tree

= (root,

subtrees), just like what

your data structure instructor would say :-)

(6)

Disclaimers about Decision Tree

Usefulness

•

human-explainable:

widely used

in business/medical data analysis

•

simple:

even freshmen can implement one :-)

•

efficient in prediction and

training

However...

•

heuristic:

mostly

little theoretical

explanations

•

heuristics:

‘heuristicsselection’

confusing to beginners

•

arguably no single

representative algorithm

decision tree: mostly

heuristic

but useful

on its own

(7)

Fun Time

(8)

A Basic Decision Tree Algorithm

G(x) =

C

P

c=1

J

b(x)

=cK

G c

(x) function

DecisionTree

data D = {(x

n

,y

n

)}

^N _n=1

if

termination criteria met

return

base hypothesis g _t

(x) else

1

learn

branching criteria b(x)

2

split D to

C

parts

D _c

= {(x

n

,y

n

) :

b(x n )

=c}

3

build sub-tree

G c

←

DecisionTree(D _c

)

4

return

G(x) =

C

P

c=1

J

b(x)

=cK

G c

(x)

four choices:

number of branches, branching

criteria, termination criteria, & base hypothesis

(9)

Decision Tree Decision Tree Algorithm

Classification and Regression Tree (C&RT)

function

DecisionTree(data D = {(x _n

,y

_n

)}

^N _n=1

) if

termination criteria met

return

base hypothesis g t

(x) else ...

2

split D to

C

parts

D _c

= {(x

_n

,y

_n

) :

b(x _n )

=c}

two simple choices

• C

=2 (binary tree)

• g _t

(x) = E

_in

-optimal

constant

• binary/multiclass classification (0/1 error): majority of {y

n

}

• regression (squared error): average of {y

n

}

disclaimer:

C&RT

here is based on

selected components

of

CART ^TM of California Statistical Software

(10)

Branching in C&RT: Purifying

function

DecisionTree(data D = {(x _n

,y

_n

)}

^N _n=1

) if

termination criteria met

return

base hypothesis g _t

(x) = E

_in

-optimal

constant

else ...

1

learn

branching criteria b(x)

2

split D to

2

parts

D _c

= {(x

_n

,y

_n

) :

b(x _n )

=c}

more simple choices

•

simple internal node for

C = 2: {1, 2}-output decision stump

•

‘easier’ subtree: branch by

purifying

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

_c

with h| ·

impurity(D _c

with h)

C&RT: bi-branching

by

purifying

(11)

Impurity Functions

by E in of optimal constant

•

regression error:

impurity(D) = 1 N

N

X

n=1

(y

n

− y ¯ )

²

with

y ¯

=

average

of {y

_n

}

•

classification error:

impurity(D) = 1 N

N

X

n=1

Jy

n

= y

^∗

K

with

y ^∗

=

majority

of {y

_n

}

for classification

•

Gini index:

1 − 1 K

K

X

k =1

P

N

n=1

Jy

n

= k K N

!

2

—all k considered together

•

classification error:

1 − max

1≤k ≤K

P

N

n=1

Jy

n

= k K N

—optimal

k = y ^∗

only

Gini

for classification,

regression error

for regression

(12)

Termination in C&RT

function

DecisionTree(data D = {(x _n

,y

n

)}

^N _n=1

) if

termination criteria met

return

base hypothesis g _t

(x) = E

_in

-optimal

constant

else ...

1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

_c

with h| ·

impurity(D c

with h)

‘forced’ to terminate when

•

all

y _n the same: impurity

= 0 =⇒

g _t

(x) =

y _n

•

all

x _n the same: no decision stumps

C&RT: fully-grown tree

with

constant leaves

that come from

bi-branching

by

purifying

(13)

Fun Time

(14)

Basic C&RT Algorithm

function

DecisionTree

data D = {(x

_n

,y

_n

)}

^N _n=1

if

cannot branch anymore

return

g t

(x) = E

_in

-optimal

constant

else

1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

_c

with h| ·

impurity(D _c

with h)

2

split D to

2

parts

D _c

= {(x

n

,y

n

) :

b(x n )

=c}

3

build sub-tree

G c

←

DecisionTree(D _c

)

4

return

G(x) =

2

P

c=1

J

b(x)

=cK

G _c

(x)

easily handle binary classification, regression, &

multi-class classification

(15)

Decision Tree Decision Tree in Practice

Regularization by Pruning

fully-grown tree: E _in (G) = 0

if all

x n

different

but

overfit

(large E

_out

) because

low-level trees built with small D _c

•

need a

regularizer, say, Ω(G) = NumberOfLeaves(G)

•

want

regularized decision tree:

argmin

all possible G

E _in

(G) + λΩ(G)

—called

pruned decision tree

•

cannot enumerate

all possible G

computationally:

—often consider only

• G

⁽⁰⁾

= fully-grown tree

• G

⁽ⁱ⁾

= argmin

_G

E

_in

(G) such that G is one-leaf removed from G

⁽ⁱ⁻¹⁾

systematic

choice of λ? validation

(16)

Branching on Categorical Features

numerical features

blood pressure:

130, 98, 115, 147, 120

branching for numerical

decision stump

b(x) =

Jx

i

≤

θ

K + 1 with

θ

∈ R

categorical features

major symptom:

fever, pain, tired, sweaty

branching for categorical

decision subset

b(x) =

Jx

i

∈

S

K + 1 with

S ⊂ {1, 2, . . . , K }

C&RT

(& general decision trees):

handles

categorical features easily

(17)

Decision Tree Decision Tree in Practice

Missing Features by Surrogate Branch

possible

b(x) =

J

weight

≤ 50kgK if

weight

missing during prediction:

•

what would human do?

• go get weight

• or, use threshold on height instead, because threshold on height ≈ threshold on weight

• surrogate branch:

• maintain surrogate branch b

₁

(x), b

₂

(x), . . . ≈ best branch b(x) during training

• allows missing feature for b(x) during prediction by using surrogate instead

C&RT: handles missing features easily

(18)

A Simple Data Set

(19)

Decision Tree Decision Tree in Action

A Simple Data Set

(20)

A Simple Data Set

(21)

A Simple Data Set

(22)

A Simple Data Set

(23)

A Simple Data Set

(24)

A Simple Data Set

(25)

A Simple Data Set

(26)

A Simple Data Set

(27)

A Simple Data Set

(28)

A Complicated Data Set

(29)

Practical Specialties of C&RT

• human-explainable

• multiclass

easily

• categorical

features easily

• missing

features easily

—almost no other learning model share

all such specialties,

except for

other decision trees

another

popular decision tree algorithm:

C4.5, with different choices of heuristics

(30)

Fun Time

(31)

Machine Learning Techniques (ᘤᢈ)