• 沒有找到結果。

C&RT here is based on selected components of CART TM of California Statistical Software

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/22

Decision Tree Decision Tree Algorithm

Classification and Regression Tree (C&RT)

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) else ...

2

split D to

C

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

two simple choices

• C

=2 (binary tree)

• g t

(x) = E

in

-optimal

constant

disclaimer:

C&RT

here is based on

selected components

of

CART TM of California Statistical Software

Decision Tree Decision Tree Algorithm

Classification and Regression Tree (C&RT)

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) else ...

2

split D to

C

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

two simple choices

• C

=2 (binary tree)

• g t

(x) = E

in

-optimal

constant

• binary/multiclass classification (0/1 error): majority of {y

n

}

• regression (squared error): average of {y

n

}

disclaimer:

C&RT

here is based on

selected components

of

CART TM of California Statistical Software

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/22

Decision Tree Decision Tree Algorithm

Classification and Regression Tree (C&RT)

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) else ...

2

split D to

C

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

two simple choices

• C

=2 (binary tree)

• g t

(x) = E

in

-optimal

constant

disclaimer:

C&RT

here is based on

selected components

of

CART TM of California Statistical Software

Decision Tree Decision Tree Algorithm

Classification and Regression Tree (C&RT)

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) else ...

2

split D to

C

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

two simple choices

• C

=2 (binary tree)

• g t

(x) = E

in

-optimal

constant

• binary/multiclass classification (0/1 error): majority of {y

n

}

• regression (squared error): average of {y

n

}

disclaimer:

C&RT

here is based on

selected components

of

CART TM of California Statistical Software

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/22

Decision Tree Decision Tree Algorithm

Classification and Regression Tree (C&RT)

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) else ...

2

split D to

C

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

two simple choices

• C

=2 (binary tree)

• g t

(x) = E

in

-optimal

constant

• binary/multiclass classification (0/1 error): majority of {y

n

}

• regression (squared error): average of {y

n

}

of

CART TM of California Statistical Software

Decision Tree Decision Tree Algorithm

Classification and Regression Tree (C&RT)

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) else ...

2

split D to

C

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

two simple choices

• C

=2 (binary tree)

• g t

(x) = E

in

-optimal

constant

• binary/multiclass classification (0/1 error): majority of {y

n

}

• regression (squared error): average of {y

n

}

disclaimer:

C&RT

here is based on

selected components

of

CART TM of California Statistical Software

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/22

Decision Tree Decision Tree Algorithm

Branching in C&RT: Purifying

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) = E

in

-optimal

constant

else ...

1

learn

branching criteria b(x)

2

split D to

2

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

more simple choices

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

C&RT: bi-branching

by

purifying

Decision Tree Decision Tree Algorithm

Branching in C&RT: Purifying

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) = E

in

-optimal

constant

else ...

1

learn

branching criteria b(x)

2

split D to

2

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

more simple choices

simple internal node for

C = 2: {1, 2}-output decision stump

‘easier’ sub-tree: branch by

purifying

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

C&RT: bi-branching

by

purifying

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/22

Decision Tree Decision Tree Algorithm

Branching in C&RT: Purifying

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) = E

in

-optimal

constant

else ...

1

learn

branching criteria b(x)

2

split D to

2

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

more simple choices

simple internal node for

C = 2: {1, 2}-output decision stump

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

C&RT: bi-branching

by

purifying

Decision Tree Decision Tree Algorithm

Branching in C&RT: Purifying

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) = E

in

-optimal

constant

else ...

1

learn

branching criteria b(x)

2

split D to

2

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

more simple choices

simple internal node for

C = 2: {1, 2}-output decision stump

‘easier’ sub-tree: branch by

purifying

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

C&RT: bi-branching

by

purifying

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/22

Decision Tree Decision Tree Algorithm

Branching in C&RT: Purifying

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) = E

in

-optimal

constant

else ...

1

learn

branching criteria b(x)

2

split D to

2

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

more simple choices

simple internal node for

C = 2: {1, 2}-output decision stump

‘easier’ sub-tree: branch by

purifying

b(x) =

argmin

2

X|D

c

with h| ·

impurity(D c

with h)

Decision Tree Decision Tree Algorithm

Branching in C&RT: Purifying

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) = E

in

-optimal

constant

else ...

1

learn

branching criteria b(x)

2

split D to

2

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

more simple choices

simple internal node for

C = 2: {1, 2}-output decision stump

‘easier’ sub-tree: branch by

purifying

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

C&RT: bi-branching

by

purifying

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/22

Decision Tree Decision Tree Algorithm

Impurity Functions

by E in of optimal constant

regression error:

impurity(D) = 1 N

N

X

n=1

(y

n

− y ¯ )

2

with

y ¯

=

average

of {y

n

}

classification error:

impurity(D) = 1 N

N

X

n=1

Jy

n

6= y

K

with

y

=

majority

of {y

n

}

for classification

1 − X

k =1

n=1

Jy

n

= k K N

—all k considered together

classification error:

1 − max

1≤k ≤K

P

N

n=1

Jy

n

= k K N

—optimal

k = y

only

popular

choices:

Gini

for classification,

regression error

for regression

Decision Tree Decision Tree Algorithm

Impurity Functions

by E in of optimal constant

regression error:

impurity(D) = 1 N

N

X

n=1

(y

n

− y ¯ )

2

with

y ¯

=

average

of {y

n

}

classification error:

impurity(D) = 1 N

N

X

n=1

Jy

n

6= y

K

with

y

=

majority

of {y

n

}

for classification

Gini index:

1 −

K

X

k =1

P

N

n=1

Jy

n

= k K N

!

2

—all k considered together

classification error:

1 − max

1≤k ≤K

P

N

n=1

Jy

n

= k K N

—optimal

k = y

only

popular

choices:

Gini

for classification,

regression error

for regression

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/22

Decision Tree Decision Tree Algorithm

Impurity Functions

by E in of optimal constant

regression error:

impurity(D) = 1 N

N

X

n=1

(y

n

− y ¯ )

2

with

y ¯

=

average

of {y

n

}

classification error:

impurity(D) = 1 N

N

X

n=1

Jy

n

6= y

K

with

y

=

majority

of {y

n

}

for classification

1 − X

k =1

n=1

Jy

n

= k K N

—all k considered together

classification error:

1 − max

1≤k ≤K

P

N

n=1

Jy

n

= k K N

—optimal

k = y

only

popular

choices:

Gini

for classification,

regression error

for regression

Decision Tree Decision Tree Algorithm

Impurity Functions

by E in of optimal constant

regression error:

impurity(D) = 1 N

N

X

n=1

(y

n

− y ¯ )

2

with

y ¯

=

average

of {y

n

}

classification error:

impurity(D) = 1 N

N

X

n=1

Jy

n

6= y

K

with

y

=

majority

of {y

n

}

for classification

Gini index:

1 −

K

X

k =1

P

N

n=1

Jy

n

= k K N

!

2

—all k considered together

classification error:

1 − max

1≤k ≤K

P

N

n=1

Jy

n

= k K N

—optimal

k = y

only

popular

choices:

Gini

for classification,

regression error

for regression

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/22

Decision Tree Decision Tree Algorithm

Impurity Functions

by E in of optimal constant

regression error:

impurity(D) = 1 N

N

X

n=1

(y

n

− y ¯ )

2

with

y ¯

=

average

of {y

n

}

classification error:

impurity(D) = 1 N

N

X

n=1

Jy

n

6= y

K

with

y

=

majority

of {y

n

}

for classification

1 − X

k =1

n=1

Jy

n

= k K N

—all k considered together

classification error:

1 − max

1≤k ≤K

P

N

n=1

Jy

n

= k K N

—optimal

k = y

only

popular

choices:

Gini

for classification,

regression error

for regression

Decision Tree Decision Tree Algorithm

Impurity Functions

by E in of optimal constant

regression error:

impurity(D) = 1 N

N

X

n=1

(y

n

− y ¯ )

2

with

y ¯

=

average

of {y

n

}

classification error:

impurity(D) = 1 N

N

X

n=1

Jy

n

6= y

K

with

y

=

majority

of {y

n

}

for classification

Gini index:

1 −

K

X

k =1

P

N

n=1

Jy

n

= k K N

!

2

—all k considered together

classification error:

1 − max

1≤k ≤K

P

N

n=1

Jy

n

= k K N

—optimal

k = y

only

popular

choices:

Gini

for classification,

regression error

for regression

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/22

by E in of optimal constant

regression error:

impurity(D) = 1 N

N

X

n=1

(y

n

− y ¯ )

2

with

y ¯

=

average

of {y

n

}

classification error:

impurity(D) = 1 N

N

X

n=1

Jy

n

6= y

K

with

y

=

majority

of {y

n

}

for classification

Gini index:

1 −

K

X

k =1

P

N

n=1

Jy

n

= k K N

!

2

—all k considered together

classification error:

1 − max

1≤k ≤K

P

N

n=1

Jy

n

= k K N

—optimal

k = y

only

Decision Tree Decision Tree Algorithm

Termination in C&RT

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) = E

in

-optimal

constant

else ...

1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

‘forced’ to terminate when

all

y n the same: impurity

= 0 =⇒

g t

(x) =

y n

all

x n the same: no decision stumps

C&RT: fully-grown tree

with

constant leaves

that come from

bi-branching

by

purifying

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/22

Decision Tree Decision Tree Algorithm

Termination in C&RT

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) = E

in

-optimal

constant

else ...

1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

‘forced’ to terminate when

all

y n the same: impurity

= 0 =⇒

g t

(x) =

y n

C&RT: fully-grown tree

with

constant leaves

that come from

bi-branching

by

purifying

Decision Tree Decision Tree Algorithm

Termination in C&RT

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) = E

in

-optimal

constant

else ...

1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

‘forced’ to terminate when

all

y n the same: impurity

= 0 =⇒

g t

(x) =

y n

all

x n the same: no decision stumps

C&RT: fully-grown tree

with

constant leaves

that come from

bi-branching

by

purifying

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/22

function

DecisionTree(data D = {(x n

,y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) = E

in

-optimal

constant

else ...

1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

‘forced’ to terminate when

all

y n the same: impurity

= 0 =⇒

g t

(x) =

y n

all

x n the same: no decision stumps

Decision Tree Decision Tree Algorithm

Fun Time

For the Gini index, 1 −P

K k =1



P

N

n=1

Jy

n

=k K N



2

. Consider K = 2, and let µ =

N N

1, where N

1

is the number of examples with y

n

=1. Which of the following formula of µ equals the Gini index in this case?

1

2µ(1 − µ)

2

2

(1 − µ)

3

2µ(1 − µ)

2

4

2

(1 − µ)

2

Reference Answer: 1

Simplify 1 − (µ

2

+ (1 − µ)

2

)and the answer should pop up.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/22

For the Gini index, 1 −P

K k =1



P

N

n=1

Jy

n

=k K N



2

. Consider K = 2, and let µ =

N N

1, where N

1

is the number of examples with y

n

=1. Which of the following formula of µ equals the Gini index in this case?

1

2µ(1 − µ)

2

2

(1 − µ)

3

2µ(1 − µ)

2

4

2

(1 − µ)

2

Reference Answer: 1

Simplify 1 − (µ

2

+ (1 − µ)

2

)and the answer

Decision Tree Decision Tree Heuristics in C&RT

Basic C&RT Algorithm

function

DecisionTree

data D = {(x

n

,y

n

)}

N n=1

 if

cannot branch anymore

return

g t

(x) = E

in

-optimal

constant

else

1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

2

split D to

2

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

3

build sub-tree

G c

DecisionTree(D c

)

4

return

G(x) =

2

P

c=1

J

b(x)

=cK

G c

(x)

easily handle binary classification, regression, &

multi-class classification

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/22

function

DecisionTree

data D = {(x

n

,y

n

)}

N n=1

 if

cannot branch anymore

return

g t

(x) = E

in

-optimal

constant

else

1

learn

branching criteria

b(x) =

argmin

decision stumps h(x) 2

X

c=1

|D

c

with h| ·

impurity(D c

with h)

2

split D to

2

parts

D c

= {(x

n

,y

n

) :

b(x n )

=c}

3

build sub-tree

G c

DecisionTree(D c

)

4

return

G(x) =

2

P

c=1

J

b(x)

=cK

G c

(x)

Decision Tree Decision Tree Heuristics in C&RT

Regularization by Pruning

fully-grown tree: E in (G) = 0

if all

x n

different

but

overfit

(large E

out

) because

low-level trees built with small D c

need a

regularizer, say, Ω(G) = NumberOfLeaves(G)

want

regularized decision tree:

argmin

all possible G

E in

(G) + λΩ(G)

—called

pruned decision tree

cannot enumerate

all possible G

computationally:

—often consider only

• G

(0)

= fully-grown tree

• G

(i)

= argmin

G

E

in

(G) such that G is one-leaf removed from G

(i−1)

systematic

choice of λ? validation

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/22

Decision Tree Decision Tree Heuristics in C&RT

Regularization by Pruning

fully-grown tree: E in (G) = 0

if all

x n

different

but

overfit

(large E

out

) because

low-level trees built with small D c

need a

regularizer, say, Ω(G) = NumberOfLeaves(G)

want

regularized decision tree:

argmin

all possible G

E in

(G) + λΩ(G)

—called

pruned decision tree

cannot enumerate

all possible G

computationally:

—often consider only

systematic

choice of λ? validation

Decision Tree Decision Tree Heuristics in C&RT

Regularization by Pruning

fully-grown tree: E in (G) = 0

if all

x n

different

but

overfit

(large E

out

) because

low-level trees built with small D c

need a

regularizer, say, Ω(G) = NumberOfLeaves(G)

want

regularized decision tree:

argmin

all possible G

E in

(G) + λΩ(G)

—called

pruned decision tree

cannot enumerate

all possible G

computationally:

—often consider only

• G

(0)

= fully-grown tree

• G

(i)

= argmin

G

E

in

(G) such that G is one-leaf removed from G

(i−1)

systematic

choice of λ? validation

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/22

Decision Tree Decision Tree Heuristics in C&RT

Regularization by Pruning

fully-grown tree: E in (G) = 0

if all

x n

different

but

overfit

(large E

out

) because

low-level trees built with small D c

need a

regularizer, say, Ω(G) = NumberOfLeaves(G)

want

regularized decision tree:

argmin

all possible G

E in

(G) + λΩ(G)

—called

pruned decision tree

cannot enumerate

all possible G

computationally:

—often consider only

systematic

choice of λ? validation

Decision Tree Decision Tree Heuristics in C&RT

Regularization by Pruning

fully-grown tree: E in (G) = 0

if all

x n

different

but

overfit

(large E

out

) because

low-level trees built with small D c

need a

regularizer, say, Ω(G) = NumberOfLeaves(G)

want

regularized decision tree:

argmin

all possible G

E in

(G) + λΩ(G)

—called

pruned decision tree

cannot enumerate

all possible G

computationally:

—often consider only

• G

(0)

= fully-grown tree

• G

(i)

= argmin

G

E

in

(G) such that G is one-leaf removed from G

(i−1)

systematic

choice of λ? validation

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/22

Decision Tree Decision Tree Heuristics in C&RT

Regularization by Pruning

fully-grown tree: E in (G) = 0

if all

x n

different

but

overfit

(large E

out

) because

low-level trees built with small D c

need a

regularizer, say, Ω(G) = NumberOfLeaves(G)

want

regularized decision tree:

argmin

all possible G

E in

(G) + λΩ(G)

—called

pruned decision tree

cannot enumerate

all possible G

computationally:

—often consider only

systematic

choice of λ? validation

Decision Tree Decision Tree Heuristics in C&RT

Regularization by Pruning

fully-grown tree: E in (G) = 0

if all

x n

different

but

overfit

(large E

out

) because

low-level trees built with small D c

need a

regularizer, say, Ω(G) = NumberOfLeaves(G)

want

regularized decision tree:

argmin

all possible G

E in

(G) + λΩ(G)

—called

pruned decision tree

cannot enumerate

all possible G

computationally:

—often consider only

• G

(0)

= fully-grown tree

• G

(i)

= argmin

G

E

in

(G) such that G is one-leaf removed from G

(i−1)

systematic

choice of λ? validation

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/22

Decision Tree Decision Tree Heuristics in C&RT

Regularization by Pruning

fully-grown tree: E in (G) = 0

if all

x n

different

but

overfit

(large E

out

) because

low-level trees built with small D c

need a

regularizer, say, Ω(G) = NumberOfLeaves(G)

want

regularized decision tree:

argmin

all possible G

E in

(G) + λΩ(G)

—called

pruned decision tree

cannot enumerate

all possible G

computationally:

—often consider only

• G

(0)

= fully-grown tree

• G

(i)

= argmin

G

E

in

(G) such that G is one-leaf removed from G

(i−1)

Decision Tree Decision Tree Heuristics in C&RT

Regularization by Pruning

fully-grown tree: E in (G) = 0

if all

x n

different

but

overfit

(large E

out

) because

low-level trees built with small D c

need a

regularizer, say, Ω(G) = NumberOfLeaves(G)

want

regularized decision tree:

argmin

all possible G

E in

(G) + λΩ(G)

—called

pruned decision tree

cannot enumerate

all possible G

computationally:

—often consider only

• G

(0)

= fully-grown tree

• G

(i)

= argmin

G

E

in

(G) such that G is one-leaf removed from G

(i−1)

systematic

choice of λ?

相關文件