• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
31
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques

( 機器學習技巧)

Lecture 9: Decision Tree

Hsuan-Tien Lin (林軒田) [email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Agenda

Lecture 9: Decision Tree

Decision Tree Hypothesis

Decision Tree Algorithm

Decision Tree in Practice

Decision Tree in Action

(3)

Decision Tree Decision Tree Hypothesis

What We Have Done

blending: aggregate

after getting g t

; learning: aggregate

as well as getting g t

aggregation type

blending learning

uniform voting/averaging

Bagging

non-uniform linear

AdaBoost

conditional

stacking

Decision Tree

decision tree: a traditional learning model that

realizes

conditional aggregation

(4)

Decision Tree for Playing Golf

G(x) =

T

X

t=1

q t

(x) ·

g t

(x)

base hypothesis g t

(x):

leaf at end of path t, a

constant

here

condition q t

(x):

Jis x on path t ?K

usually with

simple internal nodes

decision tree: arguably one of the most

human-mimicking models

(5)

Decision Tree Decision Tree Hypothesis

Recursive View of Decision Tree

Path View: G(x) =P

T

t=1 Jx on path t K

·

leaf t (x) Recursive View

G(x) =

C

X

c=1

Jb(x) = c K

·

G c

(x)

• G(x): full-tree

hypothesis

• b(x): branching criteria

• G c

(x):

sub-tree

hypothesis at the c-th branch

tree

= (root,

subtrees), just like what

your data structure instructor would say :-)

(6)

Disclaimers about Decision Tree

Usefulness

human-explainable:

widely used

in business/medical data analysis

simple:

even freshmen can implement one :-)

efficient in prediction and

training

However...

heuristic:

mostly

little theoretical

explanations

heuristics:

‘heuristicsselection’

confusing to beginners

arguably no single

representative algorithm

decision tree: mostly

heuristic

but useful

on its own

(7)

Decision Tree Decision Tree Hypothesis

Fun Time

(8)

A Basic Decision Tree Algorithm

G(x) =

C

P

c=1

J

b(x)

=cK

G c

(x) function

DecisionTree

data D = {(x

n

,y

n

)}

N n=1

 if

termination criteria met

return

base hypothesis g t

(x) else

1

learn

branching criteria b(x)

2

split D to

C

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

3

build sub-tree

G c

DecisionTree(D c

)

4

return

G(x) =

C

P

c=1

J

b(x)

=cK

G c

(x)

four choices:

number of branches, branching

criteria, termination criteria, & base hypothesis

(9)

Decision Tree Decision Tree Algorithm

Classification and Regression Tree (C&RT)

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) else ...

2

split D to

C

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

two simple choices

• C

=2 (binary tree)

• g t

(x) = E

in

-optimal

constant

• binary/multiclass classification (0/1 error): majority of {y

n

}

• regression (squared error): average of {y

n

}

disclaimer:

C&RT

here is based on

selected components

of

CART TM of California Statistical Software

(10)

Branching in C&RT: Purifying

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) = E

in

-optimal

constant

else ...

1

learn

branching criteria b(x)

2

split D to

2

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

more simple choices

simple internal node for

C = 2: {1, 2}-output decision stump

‘easier’ subtree: branch by

purifying

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

C&RT: bi-branching

by

purifying

(11)

Decision Tree Decision Tree Algorithm

Impurity Functions

by E in of optimal constant

regression error:

impurity(D) = 1 N

N

X

n=1

(y

n

− y ¯ )

2

with

y ¯

=

average

of {y

n

}

classification error:

impurity(D) = 1 N

N

X

n=1

Jy

n

= y

K

with

y

=

majority

of {y

n

}

for classification

Gini index:

1 − 1 K

K

X

k =1

P

N

n=1

Jy

n

= k K N

!

2

—all k considered together

classification error:

1 − max

1≤k ≤K

P

N

n=1

Jy

n

= k K N

—optimal

k = y

only

popular

choices:

Gini

for classification,

regression error

for regression

(12)

Termination in C&RT

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) = E

in

-optimal

constant

else ...

1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

‘forced’ to terminate when

all

y n the same: impurity

= 0 =⇒

g t

(x) =

y n

all

x n the same: no decision stumps

C&RT: fully-grown tree

with

constant leaves

that come from

bi-branching

by

purifying

(13)

Decision Tree Decision Tree Algorithm

Fun Time

(14)

Basic C&RT Algorithm

function

DecisionTree

data D = {(x

n

,y

n

)}

N n=1

 if

cannot branch anymore

return

g t

(x) = E

in

-optimal

constant

else

1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

2

split D to

2

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

3

build sub-tree

G c

DecisionTree(D c

)

4

return

G(x) =

2

P

c=1

J

b(x)

=cK

G c

(x)

easily handle binary classification, regression, &

multi-class classification

(15)

Decision Tree Decision Tree in Practice

Regularization by Pruning

fully-grown tree: E in (G) = 0

if all

x n

different

but

overfit

(large E

out

) because

low-level trees built with small D c

need a

regularizer, say, Ω(G) = NumberOfLeaves(G)

want

regularized decision tree:

argmin

all possible G

E in

(G) + λΩ(G)

—called

pruned decision tree

cannot enumerate

all possible G

computationally:

—often consider only

• G

(0)

= fully-grown tree

• G

(i)

= argmin

G

E

in

(G) such that G is one-leaf removed from G

(i−1)

systematic

choice of λ? validation

(16)

Branching on Categorical Features

numerical features

blood pressure:

130, 98, 115, 147, 120

branching for numerical

decision stump

b(x) =

Jx

i

θ

K + 1 with

θ

∈ R

categorical features

major symptom:

fever, pain, tired, sweaty

branching for categorical

decision subset

b(x) =

Jx

i

S

K + 1 with

S ⊂ {1, 2, . . . , K }

C&RT

(& general decision trees):

handles

categorical features easily

(17)

Decision Tree Decision Tree in Practice

Missing Features by Surrogate Branch

possible

b(x) =

J

weight

≤ 50kgK if

weight

missing during prediction:

what would human do?

• go get weight

• or, use threshold on height instead, because threshold on height ≈ threshold on weight

• surrogate branch:

• maintain surrogate branch b

1

(x), b

2

(x), . . . ≈ best branch b(x) during training

• allows missing feature for b(x) during prediction by using surrogate instead

C&RT: handles missing features easily

(18)

A Simple Data Set

(19)

Decision Tree Decision Tree in Action

A Simple Data Set

(20)

A Simple Data Set

(21)

Decision Tree Decision Tree in Action

A Simple Data Set

(22)

A Simple Data Set

(23)

Decision Tree Decision Tree in Action

A Simple Data Set

(24)

A Simple Data Set

(25)

Decision Tree Decision Tree in Action

A Simple Data Set

(26)

A Simple Data Set

(27)

Decision Tree Decision Tree in Action

A Simple Data Set

(28)

A Complicated Data Set

(29)

Decision Tree Decision Tree in Action

Practical Specialties of C&RT

• human-explainable

• multiclass

easily

• categorical

features easily

• missing

features easily

—almost no other learning model share

all such specialties,

except for

other decision trees

another

popular decision tree algorithm:

C4.5, with different choices of heuristics

(30)

Fun Time

(31)

Decision Tree Decision Tree in Action

Summary

Lecture 9: Decision Tree

Decision Tree Hypothesis

Decision Tree Algorithm

Decision Tree in Practice

Decision Tree in Action

參考文獻

相關文件

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.