Machine Learning Techniques ( 機器學習技法)
Lecture 10: Random Forest
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/22
Random Forest
Roadmap
1 Embedding Numerous Features: Kernel Models
2
Combining Predictive Features: Aggregation ModelsLecture 9: Decision Tree
recursive branching (purification)
forconditional aggregation
ofconstant hypotheses Lecture 10: Random Forest
Random Forest Algorithm Out-Of-Bag Estimate Feature Selection
Random Forest in Action
3 Distilling Implicit Features: Extraction Models
Random Forest Random Forest Algorithm
Recall: Bagging and Decision Tree
Bagging
function
Bag(
D, A) For t = 1, 2, . . . , T1
request size-N0
dataD ˜ t
bybootstrapping
withD2
obtain baseg t
byA(D ˜ t
) returnG
=Uniform({g t
})—reduces variance
by voting/averaging
Decision Tree
functionDTree(
D)if
termination
return baseg t
else1
learnb(x)
and splitD toD c
byb(x)
2
buildG c
←DTree( D c
)3
returnG(x) =
PC
c=1
J
b(x)
=cKG c
(x)—large variance
especially if fully-grown putting them together?
(i.e. aggregation of aggregation :-) )
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/22
Random Forest Random Forest Algorithm
Random Forest (RF)
random forest (RF) = bagging + fully-grown C&RT decision tree
function
RandomForest(
D) For t = 1, 2, . . . , T1
request size-N0
dataD ˜ t
bybootstrapping
withD2
obtain treeg t
by DTree(D ˜ t
) returnG
=Uniform({g t
})function
DTree(
D)if
termination
return baseg t
else1
learnb(x)
and splitD toD c
byb(x)
2
buildG c
←DTree( D c
)3
returnG(x) =
PC
c=1
J
b(x)
=cKG c
(x)•
highlyparallel/efficient
to learn• inherit pros
of C&RT• eliminate cons
of fully-grown treeRandom Forest Random Forest Algorithm
Diversifying by Feature Projection
recall:
data randomness
fordiversity
inbagging
randomlysample N 0 examples
fromD another possibility fordiversity:
randomly
sample d 0 features
fromx
•
when sampling index i1
, i2
, . . . , id
0: Φ(x) =(x i
1, x i
2, . . . , x i
d 0)
•
Z ∈ Rd
0: arandom subspace
ofX ∈ Rd
•
oftend 0 d
, efficient for large d—can be generally applied on other models
•
original RFre-sample new subspace for each b(x) in C&RT
RF =bagging
+random-subspace C&RT
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/22
Random Forest Random Forest Algorithm
Diversifying by Feature Expansion
randomly
sample d 0 features
fromx: Φ(x) = P
· x withrow i of P
sampled randomly∈natural basis
more
powerful
features fordiversity: row i other than natural basis
• projection
(combination) with random rowp i
ofP:
φi
(x) =p T i x
•
often considerlow-dimensional projection:
only
d 00 non-zero
components inp i
•
includesrandom subspace
asspecial case:
d 00 = 1
andp i
∈natural basis
•
original RF consider d0
randomlow-dimensional projections for each b(x)
in C&RTRF =
bagging
+ random-combinationC&RT
—randomnesseverywhere!
Random Forest Random Forest Algorithm
Fun Time
Within RF that contains random-combination C&RT trees, which of the following hypothesis is equivalent to each branching function b(x) within the tree?
1
a constant2
a decision stump3
a perceptron4
none of the other choicesReference Answer: 3
In each b(x), the input vector x is first projected by a random vector
v and then
thresholded to make a binary decision, which is exactly what a perceptron does.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/22
Random Forest Random Forest Algorithm
Fun Time
Within RF that contains random-combination C&RT trees, which of the following hypothesis is equivalent to each branching function b(x) within the tree?
1
a constant2
a decision stump3
a perceptron4
none of the other choicesReference Answer: 3
In each b(x), the input vector x is first projected by a random vector
v and then
thresholded to make a binary decision, which is exactly what a perceptron does.Random Forest Out-Of-Bag Estimate
Bagging Revisited
Bagging
function
Bag(
D, A) For t = 1, 2, . . . , T1
request size-N0
dataD ˜ t
bybootstrapping
withD2
obtain baseg t
byA(D ˜ t
) returnG
=Uniform({g t
})g
1g
2g
3· · · g
T(x
1, y
1) D ˜
1? D ˜
3D ˜
T(x
2, y
2) ? ? D ˜
3D ˜
T(x
3, y
3) ? D ˜
2? D ˜
T· · ·
(x
N, y
N) D ˜
1D ˜
2? ?
?
int-th column: not used for obtaining g t
—called
out-of-bag (OOB) examples
ofg t
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/22
Random Forest Out-Of-Bag Estimate
Number of OOB Examples
OOB
(in?)
⇐⇒not sampled
afterN 0 drawings if N 0 = N
•
probability for (xn
, yn
)to beOOB
for gt
:1 − N 1
N
•
if N large:
1 − 1
N
N
= 1
N N−1
N
= 1
1 + N−1 1
N
≈1 e
OOB
size per gt
≈e 1
NRandom Forest Out-Of-Bag Estimate
OOB versus Validation
OOB
g
1g
2g
3· · · g
T(x
1, y
1) D ˜
1? D ˜
3D ˜
T(x
2, y
2) ? ? D ˜
3D ˜
T(x
3, y
3) ? D ˜
2? D ˜
T· · ·
(x
N, y
N) D ˜
1? ? ?
Validation
g
−1g
2−· · · g
M−D
trainD
trainD
trainD
valD
valD
valD
valD
valD
valD
trainD
trainD
train• ?
likeD val
: ‘enough’ random examples unused during training•
use?
to validateg t
? easy, butrarely needed
•
use?
to validateG? E oob
(G) =N 1
PN
n=1
err(yn
,G − n
(xn
)), withG n −
contains only trees thatx n
isOOB
of,such as
G − N
(x) = average(g2
,g 3
,g T
)E oob
:self-validation
of bagging/RFHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/22
Random Forest Out-Of-Bag Estimate
Model Selection by OOB Error
Previously: by Best E val
g
m
∗ = Am
∗(D
) m∗
= argmin1≤m≤M
E
m
Em
=E val
(Am
(D train
))H
1H
2H
Mg
1g
2· · · g
M· · ·
E
1· · · E
MD
valD
traing
m∗E
2(H
m∗, E
m∗)
| {z }
pick the best
D
RF: by Best E oob
G
m
∗ = RFm
∗(D
) m∗
= argmin1≤m≤M
E
m
Em
=E oob
(RFm
(D
))•
useE oob
forself-validation
—of RF
parameters
such as d00
• no re-training
neededE oob
oftenaccurate
in practiceRandom Forest Out-Of-Bag Estimate
Fun Time
For a data set with N = 1126, what is the probability that (x
1126
, y1126
) is not sampled after bootstrapping N0
=N samples from the data set?1
0.1132
0.3683
0.6324
0.887Reference Answer: 2
The value of (1−
N 1
)N
with N = 1126 is about 0.367716, which is close to1 e
=0.367879.Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/22
Random Forest Out-Of-Bag Estimate
Fun Time
For a data set with N = 1126, what is the probability that (x
1126
, y1126
) is not sampled after bootstrapping N0
=N samples from the data set?1
0.1132
0.3683
0.6324
0.887Reference Answer: 2
The value of (1−
N 1
)N
with N = 1126 is about 0.367716, which is close to1 e
=0.367879.Random Forest Feature Selection
Feature Selection
for
x = (x 1
, x2
, . . . , xd
), want to remove• redundant
features: like keeping one of ‘age’ and ‘full birthday’• irrelevant
features: like insurance type for cancer prediction and only ‘learn’subset-transform Φ(x) = (x i
1, x i
2, x i
d 0)
with d
0
< d for g(Φ(x)) advantages:• efficiency: simpler
hypothesis and shorter prediction time• generalization: ‘feature
noise’ removed• interpretability
disadvantages:
• computation:
‘combinatorial’ optimization in training
• overfit: ‘combinatorial’
selection
• mis-interpretability
decision tree: a rare modelwith
built-in feature selection
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/22
Random Forest Feature Selection
Feature Selection by Importance
idea: if possible to calculate
importance(i) for i = 1, 2, . . . , d
then can select i
1
, i2
, . . . , id
0 of top-d0 importance importance by linear model
score =
w T x =
Xd
i=1
w
i
xi
•
intuitive estimate:importance(i) = |w i |
with some‘good’ w
•
getting‘good’ w: learned from data
•
non-linear models? oftenmuch harder
next:
‘easy’
feature selection in RFRandom Forest Feature Selection
Feature Importance by Permutation Test
idea: random test
—if feature i needed,
‘random’ values of x n,i degrades performance
•
whichrandom values?
• uniform, Gaussian, . . .: P(x
i) changed
• bootstrap, permutation (of {x
n,i}
Nn=1): P(x
i) approximately remained
• permutation
test:importance(i) = performance(
D) −performance( D (p) )
withD (p)
isD with {xn,i
} replaced bypermuted {x n,i } N n=1
permutation
test: a general statistical tool for arbitrary non-linear models like RFHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/22
Random Forest Feature Selection
Feature Importance in Original Random Forest
permutation
test:importance(i) = performance(
D) −performance( D (p) )
withD (p)
isD with {xn,i
} replaced bypermuted {x n,i } N n=1
• performance( D (p) ): needs re-training and validation
in general• ‘escaping’ validation? OOB
in RF•
original RF solution: importance(i) = Eoob
(G)−E oob (p) (G),
whereE oob (p)
comes from replacing each request of xn,i
by apermuted OOB
valueRF
feature selection
viapermutation
+OOB:
often efficient and promising in practice
Random Forest Feature Selection
Fun Time
For RF, if the 1126-th feature within the data set is a constant 5566, what would importance(i) be?
1
02
13
11264
5566Reference Answer: 1
When a feature is a constant, permutation does not change its value. Then, E
oob
(G) and Eoob (p)
(G) are the same, and thusimportance(i) = 0.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/22
Random Forest Feature Selection
Fun Time
For RF, if the 1126-th feature within the data set is a constant 5566, what would importance(i) be?
1
02
13
11264
5566Reference Answer: 1
When a feature is a constant, permutation does not change its value. Then, E
oob
(G) and Eoob (p)
(G) are the same, and thusimportance(i) = 0.
Random Forest Random Forest in Action
A Simple Data Set
gC
&
RT gt
(N0
=N/2) G with first t treeswith random combination
‘smooth’ and large-margin-like boundary with many trees
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22
Random Forest Random Forest in Action
A Simple Data Set
gC
&
RT gt
(N0
=N/2) G with first t treeswith random combination
‘smooth’ and large-margin-like boundary
with many trees
Random Forest Random Forest in Action
A Simple Data Set
gC
&
RT gt
(N0
=N/2) G with first t treeswith random combination
‘smooth’ and large-margin-like boundary with many trees
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22
Random Forest Random Forest in Action
A Simple Data Set
gC
&
RT gt
(N0
=N/2) G with first t treeswith random combination
‘smooth’ and large-margin-like boundary
with many trees
Random Forest Random Forest in Action
A Simple Data Set
gC
&
RT gt
(N0
=N/2) G with first t treeswith random combination
‘smooth’ and large-margin-like boundary with many trees
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22
Random Forest Random Forest in Action
A Simple Data Set
gC
&
RT gt
(N0
=N/2) G with first t treeswith random combination
‘smooth’ and large-margin-like boundary
with many trees
Random Forest Random Forest in Action
A Simple Data Set
gC
&
RT gt
(N0
=N/2) G with first t treeswith random combination
‘smooth’ and large-margin-like boundary with many trees
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22
Random Forest Random Forest in Action
A Simple Data Set
gC
&
RT gt
(N0
=N/2) G with first t treeswith random combination
‘smooth’ and large-margin-like boundary
with many trees
Random Forest Random Forest in Action
A Simple Data Set
gC
&
RT gt
(N0
=N/2) G with first t treeswith random combination
‘smooth’ and large-margin-like boundary with many trees
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22
Random Forest Random Forest in Action
A Simple Data Set
gC
&
RT gt
(N0
=N/2) G with first t treeswith random combination
‘smooth’ and large-margin-like boundary
with many trees
Random Forest Random Forest in Action
A Simple Data Set
gC
&
RT gt
(N0
=N/2) G with first t treeswith random combination
‘smooth’ and large-margin-like boundary with many trees
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22
Random Forest Random Forest in Action
A Complicated Data Set
g
t
(N0
=N/2) G with first t trees‘easy yet robust’ nonlinear model
Random Forest Random Forest in Action
A Complicated Data Set
g
t
(N0
=N/2) G with first t trees‘easy yet robust’ nonlinear model
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22
Random Forest Random Forest in Action
A Complicated Data Set
g
t
(N0
=N/2) G with first t trees‘easy yet robust’ nonlinear model
Random Forest Random Forest in Action
A Complicated Data Set
g
t
(N0
=N/2) G with first t trees‘easy yet robust’ nonlinear model
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22
Random Forest Random Forest in Action
A Complicated Data Set
g
t
(N0
=N/2) G with first t trees‘easy yet robust’ nonlinear model
Random Forest Random Forest in Action
A Complicated and Noisy Data Set
g
t
(N0
=N/2) G with first t treesnoise corrected
by votingHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22
Random Forest Random Forest in Action
A Complicated and Noisy Data Set
g
t
(N0
=N/2) G with first t treesnoise corrected
by votingRandom Forest Random Forest in Action
A Complicated and Noisy Data Set
g
t
(N0
=N/2) G with first t treesnoise corrected
by votingHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22
Random Forest Random Forest in Action
A Complicated and Noisy Data Set
g
t
(N0
=N/2) G with first t treesnoise corrected
by votingRandom Forest Random Forest in Action
A Complicated and Noisy Data Set
g
t
(N0
=N/2) G with first t treesnoise corrected
by votingHsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22
Random Forest Random Forest in Action
How Many Trees Needed?
almost every theory: the more,
the ‘better’
assuming
good ¯ g = lim T →∞ G Our NTU Experience
•
KDDCup 2013 Track 1 (yes, NTU is world champion again! :-)):predicting author-paper relation
•
Eval
ofthousands
of trees: [0.015, 0.019] dependingon seed;
E
out
of top 20 teams: [0.014, 0.019]•
decision: take12000 trees
withseed 1
cons of RF: may need lots of trees
if the whole random process too unstable
—should double-check
stability of G
to ensureenough trees
Random Forest Random Forest in Action
Fun Time
Which of the following is
not
the best use of Random Forest?1
train each tree with bootstrapped data2
use Eoob
to validate the performance3
conduct feature selection with permutation test4
fix the number of trees, T , to the lucky number 1126Reference Answer: 4
A good value of T can depend on the nature of the data and the stability of the whole random process.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/22
Random Forest Random Forest in Action
Fun Time
Which of the following is
not
the best use of Random Forest?1
train each tree with bootstrapped data2
use Eoob
to validate the performance3
conduct feature selection with permutation test4
fix the number of trees, T , to the lucky number 1126Reference Answer: 4
A good value of T can depend on the nature of the data and the stability of the whole random process.
Random Forest Random Forest in Action
Summary
1 Embedding Numerous Features: Kernel Models
2
Combining Predictive Features: Aggregation ModelsLecture 10: Random Forest
Random Forest Algorithm
bag of trees on randomly projected subspaces Out-Of-Bag Estimate
self-validation with OOB examples Feature Selection
permutation test for feature importance Random Forest in Action
‘smooth’ boundary with many trees
• next: boosted decision trees beyond classification
3 Distilling Implicit Features: Extraction Models
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/22