• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
45
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 10: Random Forest

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/22

(2)

Random Forest

Roadmap

1 Embedding Numerous Features: Kernel Models

2

Combining Predictive Features: Aggregation Models

Lecture 9: Decision Tree

recursive branching (purification)

for

conditional aggregation

of

constant hypotheses Lecture 10: Random Forest

Random Forest Algorithm Out-Of-Bag Estimate Feature Selection

Random Forest in Action

3 Distilling Implicit Features: Extraction Models

(3)

Random Forest Random Forest Algorithm

Recall: Bagging and Decision Tree

Bagging

function

Bag(

D, A) For t = 1, 2, . . . , T

1

request size-N

0

data

D ˜ t

by

bootstrapping

withD

2

obtain base

g t

byA(

D ˜ t

) return

G

=Uniform({

g t

})

—reduces variance

by voting/averaging

Decision Tree

function

DTree(

D)

if

termination

return base

g t

else

1

learn

b(x)

and splitD to

D c

by

b(x)

2

build

G c

DTree( D c

)

3

return

G(x) =

P

C

c=1

J

b(x)

=cK

G c

(x)

—large variance

especially if fully-grown putting them together?

(i.e. aggregation of aggregation :-) )

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/22

(4)

Random Forest Random Forest Algorithm

Random Forest (RF)

random forest (RF) = bagging + fully-grown C&RT decision tree

function

RandomForest(

D) For t = 1, 2, . . . , T

1

request size-N

0

data

D ˜ t

by

bootstrapping

withD

2

obtain tree

g t

by DTree(

D ˜ t

) return

G

=Uniform({

g t

})

function

DTree(

D)

if

termination

return base

g t

else

1

learn

b(x)

and splitD to

D c

by

b(x)

2

build

G c

DTree( D c

)

3

return

G(x) =

P

C

c=1

J

b(x)

=cK

G c

(x)

highly

parallel/efficient

to learn

inherit pros

of C&RT

eliminate cons

of fully-grown tree

(5)

Random Forest Random Forest Algorithm

Diversifying by Feature Projection

recall:

data randomness

for

diversity

in

bagging

randomly

sample N 0 examples

fromD another possibility for

diversity:

randomly

sample d 0 features

from

x

when sampling index i

1

, i

2

, . . . , i

d

0: Φ(x) =

(x i

1

, x i

2

, . . . , x i

d 0

)

Z ∈ R

d

0: a

random subspace

ofX ∈ R

d

often

d 0  d

, efficient for large d

—can be generally applied on other models

original RF

re-sample new subspace for each b(x) in C&RT

RF =

bagging

+

random-subspace C&RT

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/22

(6)

Random Forest Random Forest Algorithm

Diversifying by Feature Expansion

randomly

sample d 0 features

from

x: Φ(x) = P

· x with

row i of P

sampled randomly∈

natural basis

more

powerful

features for

diversity: row i other than natural basis

projection

(combination) with random row

p i

of

P:

φ

i

(x) =

p T i x

often consider

low-dimensional projection:

only

d 00 non-zero

components in

p i

includes

random subspace

as

special case:

d 00 = 1

and

p i

natural basis

original RF consider d

0

random

low-dimensional projections for each b(x)

in C&RT

RF =

bagging

+ random-combination

C&RT

—randomnesseverywhere!

(7)

Random Forest Random Forest Algorithm

Fun Time

Within RF that contains random-combination C&RT trees, which of the following hypothesis is equivalent to each branching function b(x) within the tree?

1

a constant

2

a decision stump

3

a perceptron

4

none of the other choices

Reference Answer: 3

In each b(x), the input vector x is first projected by a random vector

v and then

thresholded to make a binary decision, which is exactly what a perceptron does.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/22

(8)

Random Forest Random Forest Algorithm

Fun Time

Within RF that contains random-combination C&RT trees, which of the following hypothesis is equivalent to each branching function b(x) within the tree?

1

a constant

2

a decision stump

3

a perceptron

4

none of the other choices

Reference Answer: 3

In each b(x), the input vector x is first projected by a random vector

v and then

thresholded to make a binary decision, which is exactly what a perceptron does.

(9)

Random Forest Out-Of-Bag Estimate

Bagging Revisited

Bagging

function

Bag(

D, A) For t = 1, 2, . . . , T

1

request size-N

0

data

D ˜ t

by

bootstrapping

withD

2

obtain base

g t

byA(

D ˜ t

) return

G

=Uniform({

g t

})

g

1

g

2

g

3

· · · g

T

(x

1

, y

1

) D ˜

1

? D ˜

3

D ˜

T

(x

2

, y

2

) ? ? D ˜

3

D ˜

T

(x

3

, y

3

) ? D ˜

2

? D ˜

T

· · ·

(x

N

, y

N

) D ˜

1

D ˜

2

? ?

?

in

t-th column: not used for obtaining g t

—called

out-of-bag (OOB) examples

of

g t

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/22

(10)

Random Forest Out-Of-Bag Estimate

Number of OOB Examples

OOB

(in

?)

⇐⇒

not sampled

after

N 0 drawings if N 0 = N

probability for (x

n

, y

n

)to be

OOB

for g

t

:

1 − N 1



N

if N large:



1 − 1

N



N

= 1



N N−1



N

= 1



1 + N−1 1



N

1 e

OOB

size per g

t

e 1

N

(11)

Random Forest Out-Of-Bag Estimate

OOB versus Validation

OOB

g

1

g

2

g

3

· · · g

T

(x

1

, y

1

) D ˜

1

? D ˜

3

D ˜

T

(x

2

, y

2

) ? ? D ˜

3

D ˜

T

(x

3

, y

3

) ? D ˜

2

? D ˜

T

· · ·

(x

N

, y

N

) D ˜

1

? ? ?

Validation

g

1

g

2

· · · g

M

D

train

D

train

D

train

D

val

D

val

D

val

D

val

D

val

D

val

D

train

D

train

D

train

• ?

like

D val

: ‘enough’ random examples unused during training

use

?

to validate

g t

? easy, but

rarely needed

use

?

to validate

G? E oob

(G) =

N 1

P

N

n=1

err(y

n

,

G n

(x

n

)), with

G n

contains only trees that

x n

is

OOB

of,

such as

G N

(x) = average(g

2

,

g 3

,

g T

)

E oob

:

self-validation

of bagging/RF

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/22

(12)

Random Forest Out-Of-Bag Estimate

Model Selection by OOB Error

Previously: by Best E val

g

m

= A

m

(

D

) m

= argmin

1≤m≤M

E

m

E

m

=

E val

(A

m

(

D train

))

H

1

H

2

H

M

g

1

g

2

· · · g

M

· · ·

E

1

· · · E

M

D

val

D

train

g

m

E

2

(H

m

, E

m

)

| {z }

pick the best

D

RF: by Best E oob

G

m

= RF

m

(

D

) m

= argmin

1≤m≤M

E

m

E

m

=

E oob

(RF

m

(

D

))

use

E oob

for

self-validation

—of RF

parameters

such as d

00

• no re-training

needed

E oob

often

accurate

in practice

(13)

Random Forest Out-Of-Bag Estimate

Fun Time

For a data set with N = 1126, what is the probability that (x

1126

, y

1126

) is not sampled after bootstrapping N

0

=N samples from the data set?

1

0.113

2

0.368

3

0.632

4

0.887

Reference Answer: 2

The value of (1−

N 1

)

N

with N = 1126 is about 0.367716, which is close to

1 e

=0.367879.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/22

(14)

Random Forest Out-Of-Bag Estimate

Fun Time

For a data set with N = 1126, what is the probability that (x

1126

, y

1126

) is not sampled after bootstrapping N

0

=N samples from the data set?

1

0.113

2

0.368

3

0.632

4

0.887

Reference Answer: 2

The value of (1−

N 1

)

N

with N = 1126 is about 0.367716, which is close to

1 e

=0.367879.

(15)

Random Forest Feature Selection

Feature Selection

for

x = (x 1

, x

2

, . . . , x

d

), want to remove

redundant

features: like keeping one of ‘age’ and ‘full birthday’

irrelevant

features: like insurance type for cancer prediction and only ‘learn’

subset-transform Φ(x) = (x i

1

, x i

2

, x i

d 0

)

with d

0

< d for g(Φ(x)) advantages:

efficiency: simpler

hypothesis and shorter prediction time

generalization: ‘feature

noise’ removed

interpretability

disadvantages:

computation:

‘combinatorial’ optimization in training

overfit: ‘combinatorial’

selection

mis-interpretability

decision tree: a rare model

with

built-in feature selection

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/22

(16)

Random Forest Feature Selection

Feature Selection by Importance

idea: if possible to calculate

importance(i) for i = 1, 2, . . . , d

then can select i

1

, i

2

, . . . , i

d

0 of top-d

0 importance importance by linear model

score =

w T x =

X

d

i=1

w

i

x

i

intuitive estimate:

importance(i) = |w i |

with some

‘good’ w

getting

‘good’ w: learned from data

non-linear models? often

much harder

next:

‘easy’

feature selection in RF

(17)

Random Forest Feature Selection

Feature Importance by Permutation Test

idea: random test

—if feature i needed,

‘random’ values of x n,i degrades performance

which

random values?

• uniform, Gaussian, . . .: P(x

i

) changed

• bootstrap, permutation (of {x

n,i

}

Nn=1

): P(x

i

) approximately remained

permutation

test:

importance(i) = performance(

D) −

performance( D (p) )

with

D (p)

isD with {x

n,i

} replaced by

permuted {x n,i } N n=1

permutation

test: a general statistical tool for arbitrary non-linear models like RF

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/22

(18)

Random Forest Feature Selection

Feature Importance in Original Random Forest

permutation

test:

importance(i) = performance(

D) −

performance( D (p) )

with

D (p)

isD with {x

n,i

} replaced by

permuted {x n,i } N n=1

• performance( D (p) ): needs re-training and validation

in general

‘escaping’ validation? OOB

in RF

original RF solution: importance(i) = E

oob

(G)−

E oob (p) (G),

where

E oob (p)

comes from replacing each request of x

n,i

by a

permuted OOB

value

RF

feature selection

via

permutation

+

OOB:

often efficient and promising in practice

(19)

Random Forest Feature Selection

Fun Time

For RF, if the 1126-th feature within the data set is a constant 5566, what would importance(i) be?

1

0

2

1

3

1126

4

5566

Reference Answer: 1

When a feature is a constant, permutation does not change its value. Then, E

oob

(G) and E

oob (p)

(G) are the same, and thus

importance(i) = 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/22

(20)

Random Forest Feature Selection

Fun Time

For RF, if the 1126-th feature within the data set is a constant 5566, what would importance(i) be?

1

0

2

1

3

1126

4

5566

Reference Answer: 1

When a feature is a constant, permutation does not change its value. Then, E

oob

(G) and E

oob (p)

(G) are the same, and thus

importance(i) = 0.

(21)

Random Forest Random Forest in Action

A Simple Data Set

gC

&

RT g

t

(N

0

=N/2) G with first t trees

with random combination

‘smooth’ and large-margin-like boundary with many trees

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22

(22)

Random Forest Random Forest in Action

A Simple Data Set

gC

&

RT g

t

(N

0

=N/2) G with first t trees

with random combination

‘smooth’ and large-margin-like boundary

with many trees

(23)

Random Forest Random Forest in Action

A Simple Data Set

gC

&

RT g

t

(N

0

=N/2) G with first t trees

with random combination

‘smooth’ and large-margin-like boundary with many trees

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22

(24)

Random Forest Random Forest in Action

A Simple Data Set

gC

&

RT g

t

(N

0

=N/2) G with first t trees

with random combination

‘smooth’ and large-margin-like boundary

with many trees

(25)

Random Forest Random Forest in Action

A Simple Data Set

gC

&

RT g

t

(N

0

=N/2) G with first t trees

with random combination

‘smooth’ and large-margin-like boundary with many trees

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22

(26)

Random Forest Random Forest in Action

A Simple Data Set

gC

&

RT g

t

(N

0

=N/2) G with first t trees

with random combination

‘smooth’ and large-margin-like boundary

with many trees

(27)

Random Forest Random Forest in Action

A Simple Data Set

gC

&

RT g

t

(N

0

=N/2) G with first t trees

with random combination

‘smooth’ and large-margin-like boundary with many trees

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22

(28)

Random Forest Random Forest in Action

A Simple Data Set

gC

&

RT g

t

(N

0

=N/2) G with first t trees

with random combination

‘smooth’ and large-margin-like boundary

with many trees

(29)

Random Forest Random Forest in Action

A Simple Data Set

gC

&

RT g

t

(N

0

=N/2) G with first t trees

with random combination

‘smooth’ and large-margin-like boundary with many trees

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22

(30)

Random Forest Random Forest in Action

A Simple Data Set

gC

&

RT g

t

(N

0

=N/2) G with first t trees

with random combination

‘smooth’ and large-margin-like boundary

with many trees

(31)

Random Forest Random Forest in Action

A Simple Data Set

gC

&

RT g

t

(N

0

=N/2) G with first t trees

with random combination

‘smooth’ and large-margin-like boundary with many trees

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22

(32)

Random Forest Random Forest in Action

A Complicated Data Set

g

t

(N

0

=N/2) G with first t trees

‘easy yet robust’ nonlinear model

(33)

Random Forest Random Forest in Action

A Complicated Data Set

g

t

(N

0

=N/2) G with first t trees

‘easy yet robust’ nonlinear model

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22

(34)

Random Forest Random Forest in Action

A Complicated Data Set

g

t

(N

0

=N/2) G with first t trees

‘easy yet robust’ nonlinear model

(35)

Random Forest Random Forest in Action

A Complicated Data Set

g

t

(N

0

=N/2) G with first t trees

‘easy yet robust’ nonlinear model

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22

(36)

Random Forest Random Forest in Action

A Complicated Data Set

g

t

(N

0

=N/2) G with first t trees

‘easy yet robust’ nonlinear model

(37)

Random Forest Random Forest in Action

A Complicated and Noisy Data Set

g

t

(N

0

=N/2) G with first t trees

noise corrected

by voting

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22

(38)

Random Forest Random Forest in Action

A Complicated and Noisy Data Set

g

t

(N

0

=N/2) G with first t trees

noise corrected

by voting

(39)

Random Forest Random Forest in Action

A Complicated and Noisy Data Set

g

t

(N

0

=N/2) G with first t trees

noise corrected

by voting

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22

(40)

Random Forest Random Forest in Action

A Complicated and Noisy Data Set

g

t

(N

0

=N/2) G with first t trees

noise corrected

by voting

(41)

Random Forest Random Forest in Action

A Complicated and Noisy Data Set

g

t

(N

0

=N/2) G with first t trees

noise corrected

by voting

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22

(42)

Random Forest Random Forest in Action

How Many Trees Needed?

almost every theory: the more,

the ‘better’

assuming

good ¯ g = lim T →∞ G Our NTU Experience

KDDCup 2013 Track 1 (yes, NTU is world champion again! :-)):

predicting author-paper relation

E

val

of

thousands

of trees: [0.015, 0.019] depending

on seed;

E

out

of top 20 teams: [0.014, 0.019]

decision: take

12000 trees

with

seed 1

cons of RF: may need lots of trees

if the whole random process too unstable

—should double-check

stability of G

to ensure

enough trees

(43)

Random Forest Random Forest in Action

Fun Time

Which of the following is

not

the best use of Random Forest?

1

train each tree with bootstrapped data

2

use E

oob

to validate the performance

3

conduct feature selection with permutation test

4

fix the number of trees, T , to the lucky number 1126

Reference Answer: 4

A good value of T can depend on the nature of the data and the stability of the whole random process.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/22

(44)

Random Forest Random Forest in Action

Fun Time

Which of the following is

not

the best use of Random Forest?

1

train each tree with bootstrapped data

2

use E

oob

to validate the performance

3

conduct feature selection with permutation test

4

fix the number of trees, T , to the lucky number 1126

Reference Answer: 4

A good value of T can depend on the nature of the data and the stability of the whole random process.

(45)

Random Forest Random Forest in Action

Summary

1 Embedding Numerous Features: Kernel Models

2

Combining Predictive Features: Aggregation Models

Lecture 10: Random Forest

Random Forest Algorithm

bag of trees on randomly projected subspaces Out-Of-Bag Estimate

self-validation with OOB examples Feature Selection

permutation test for feature importance Random Forest in Action

‘smooth’ boundary with many trees

next: boosted decision trees beyond classification

3 Distilling Implicit Features: Extraction Models

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/22

參考文獻

相關文件

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.