整合學習

(1)

http://www.hmwu.idv.tw

吳漢銘國立政治大學統計學系

整合學習

Ensemble Learning

C05

(2)

本章大綱&學習目標

 Machine Learning

 Resampling methods

 Jackknife (leave-one-out)

 Bootstrapping

 Ensemble Learning

 bagging

 boosting

 Imbalanced Data Problem

 under-sampling

 over-sampling

2/29

Common Machine Learning Algorithms

Linear Regression, Logistic Regression,

Decision Tree, SVM, Naive Bayes, KNN,

K-Means, Random Forest, Dimensionality

Reduction, Boosting

(3)

Difference between

Machine Learning & Statistical Modeling

3/29

https://www.analyticsvidhya.com/blog/2015/07/difference-machine-learning-statistical-modeling/

機器學習和統計棤型的差異

http://vvar.pixnet.net/blog/post/242048881

為什麼統計學家、機器學習專家解決同一問題的方法差別那麼大?

https://read01.com/EBPPK7.html

深度| 機器學習與統計學是互補的嗎？

https://read01.com/ezQ3K.html

• Machine Learning is an algorithm that can learn from data without relying on rules-based programming.

• Statistical Modelling is the formalization of relationships between variables in the

form of mathematical equations.

(4)

Why Ensemble Learning? ^4/29

> prediction.accuracy.rate()

no.classifiers at.least.one.accuracy 1.0 0.5

> t(sapply(1:10, prediction.accuracy.rate)) no.classifiers at.least.one.accuracy [1,] 1 0.5000000 [2,] 2 0.7500000 [3,] 3 0.8750000 [4,] 4 0.9375000 [5,] 5 0.9687500 [6,] 6 0.9843750 [7,] 7 0.9921875 [8,] 8 0.9960938 [9,] 9 0.9980469 [10,] 10 0.9990234

prediction.accuracy.rate <- function(no.classifier=1, accuracy.rate=0.5){

c(no.classifiers=no.classifier,

at.least.one.accuracy=1-(1-accuracy.rate)^no.classifier) }

Training data

classifier 1

classifier 2

classifier k

y

¹

y

²

y

^k

ŷ

testing data y

replicates or

various algorithms

(5)

Why Resampling?



Resampling is any of a variety of methods for:



Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping).



Exchanging labels on data points when performing significance tests (permutation tests, randomization tests)



Validating models by using random subsets (bootstrapping, cross validation)

https://en.wikipedia.org/wiki/Resampling_(statistics)



This single sample method can serve as a mini population, from which repeated small samples are drawn with replacement over and over again.



As well as saving time and money, bootstrapped samples can be

quite good approximations for population parameters.

(6)

Jackknife Resampling ^6/29

the calculated estimator of the parameter based on all n observations the average of these "leave-one-out" estimates

the resulting bias-corrected jackknife estimate

> # install.packages("bootstrap")

> library(bootstrap)

> jackknife

function (x, theta, ...) {

call <- match.call() n <- length(x)

u <- rep(0, n) for (i in 1:n) {

u[i] <- theta(x[-i], ...) }

thetahat <- theta(x, ...)

jack.bias <- (n - 1) * (mean(u) - thetahat)

jack.se <- sqrt(((n - 1)/n) * sum((u - mean(u))^2))

return(list(jack.se = jack.se, jack.bias = jack.bias, jack.values = u, call = call))

}

<environment: namespace:bootstrap>

(7)

Example: Jackknife Estimate the Coefficient of Variation

> set.seed(12345)

> x <- runif(30)

> n <- length(x)

> theta <- CV(x)

> CV <- function(x) sqrt(var(x))/mean(x)

> theta.i <- sapply(1:n, function(i) CV(x[-i]))

> theta.i

[1] 0.5497915 0.5442365 0.5489822 0.5436256 0.5448185 0.5195935 0.5359400 0.5472011 [9] 0.5496842 0.5357489 0.5011942 0.5178517 0.5495427 0.4958063 0.5409312 0.5451245 [17] 0.5407236 0.5416770 0.5211182 0.5390234 0.5446755 0.5360780 0.5378925 0.5499674 [25] 0.5501676 0.5408382 0.5500584 0.5484004 0.5265137 0.5461715

> theta.jack <- n*theta - (n-1)*mean(theta.i)

> theta.jack [1] 0.5356475

> plot(theta.i)

Jackknifing can be useful for analyzing

if influential observations are affecting

our estimates.

(8)

Jackknife the Coefficients of a Linear Regression Model

8/29

> set.seed(12345)

> n <- 50; p <- 5

> mydata <- as.data.frame(matrix(rnorm(n*p), ncol=p))

> head(mydata, 3)

V1 V2 V3 V4 V5 1 0.5855288 -0.54038607 0.2239254 -1.6193283 -1.4361457 2 0.7094660 1.94769266 -1.1562233 0.5483979 -0.6292596 3 -0.1093033 0.05359027 0.4224185 0.1952822 0.2435218

> model.lm <- formula(V1 ~ V2 + V3 + V4)

> theta <- function(x, xdata, coefficient){

+ coef(lm(model.lm, data=xdata[x, ]))[coefficient]

+ }

> results <- jackknife(1:n, theta, xdata=mydata, coefficient="(Intercept)")

> results

$jack.se

[1] 0.1672309

$jack.bias (Intercept) 0.003368696

$jack.values

[1] 0.1412249 0.1570365 0.1723303 0.1703336 0.1529388 0.2038722 0.1620162 0.1754961 ...

[41] 0.1384219 0.2296432 0.1793121 0.1429386 0.1545121 0.1456370 0.2016571 0.1582340 [49] 0.1536307 0.2034109

$call

jackknife(x = 1:n, theta = theta, xdata = mydata, coefficient = "(Intercept)")

(9)

Bootstrap Methods



Bootstrapping is a statistical method for estimating the sampling distribution of an

estimator by sampling with replacement from the original sample, of the same size as the original sample.



The name "bootstrapping" comes from the phrase:

"To lift himself up by his bootstraps".



This refers to something that is preposterous and impossible.



Try as hard as you can, you cannot lift yourself into the air by tugging at pieces of leather on your boots.

Bradley Efron 1938~

Department of Statistics

Stanford University

(10)

Bootstrapping ^10/29

Original Data

.

Bootstrap Sample

Bootstrap Statistics Statistic 1

Statistic 2*

Statistic B*

Bootstrap Distribution sampling

with replacement

Sample

Statistic ..

• Types of bootstrap scheme: Case

resampling, Bayesian bootstrap, Smooth bootstrap, Parametric bootstrap,

Resampling residuals, Gaussian process regression bootstrap, Wild bootstrap, Block bootstrap.

• An empirical bootstrap sample is drawn from observations.

• A parametric bootstrap sample is drawn

from a parameterized distribution (e.g. a

normal distribution).

(11)

Example: Bootstrap Estimate the Coefficient of Variation

> set.seed(12345)

> x <- runif(30)

> CV <- function(x) sqrt(var(x))/mean(x)

> CV(x)

[1] 0.5380304

> CV(sample(x, replace=T)) # a single bootstrap sample [1] 0.5459389

> boot <- replicate(n=100, expr=CV(sample(x, replace=T)))

> boot

[1] 0.5044811 0.5286011 0.4634611 0.5605438 0.4835447 0.5374531 0.4857342 0.4342565 ...

[89] 0.5297020 0.5121274 0.4938053 0.5479498 0.5262306 0.6095145 0.5322045 0.6069263 [97] 0.5374840 0.4921430 0.4674226 0.4573680

> mean(boot) [1] 0.5251909

> var(boot) [1] 0.006107636

> hist(boot)

(12)

bootstrap Package ^12/29

> # install.packages("bootstrap")

> set.seed(12345)

> x <- rnorm(20)

> mean(x)

[1] 0.07651681

> (x.bootstrap.mean <- bootstrap(x, 50, theta=mean))

$thetastar

[1] 0.486197466 -0.160488357 0.274920990 0.398499864 -0.399967845 0.116086370 ...

[43] -0.348643786 0.185330636 -0.070823890 0.057609481 0.062067504 0.043716794 [49] -0.279597885 0.243843620

$func.thetastar NULL

$jack.boot.val NULL

$jack.boot.se NULL

$call

bootstrap(x = x, nboot = 50, theta = mean)

> mean(x.bootstrap.mean$thetastar) [1] 0.08647268

語法:

bootstrap(x, nboot, theta, ..., func=NULL) x:

a vector containing the data.

nboot:

the number of bootstrap samples.

theta:

function to be bootstrapped.

> mu.hat <- mean(x)

> n <- length(x)

> ja <- jackknife(x, mean)

> mu.hat.jack <- n*mu.hat - (n-1)*mean(ja$jack.values)

> # or

> mu.hat.jack <- mu.hat - ja$jack.bias

Bootstrap Estimation of

the Sample Mean

(13)

Bagging:

Bootstrap Aggregating

13/29

http://blog.csdn.net/bymaymay/article/details/77824574

• Breiman, L. (1996). Bagging predictors, Machine Learning, Vol. 26, pp. 123-140.

• Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm, Proceedings of

the Thirteenth International Conference, Machine Learning.

(14)

Boosting ^14/29

http://blog.csdn.net/bymaymay/article/details/77824574

Ad-Boost.M1 (Freund and Schapire, 1996)

(15)

Example: Apply rpart to Vehicle Data

> mfinal <- 10 #Defaults to mfinal=100 iterations

> maxdepth <- 5

> Vehicle.rpart <- rpart(Class ~ ., data = Vehicle.train, maxdepth = maxdepth)

> Vehicle.rpart.pred <- predict(Vehicle.rpart, newdata = Vehicle.test, type = "class")

> (tb <- table(Vehicle.rpart.pred, Observed.Class=Vehicle.test$Class)) Observed.Class

Vehicle.rpart.pred bus opel saab van bus 69 10 6 2 opel 1 25 13 3 saab 1 34 37 8 van 2 7 5 59

> (error.rpart <- 1 - (sum(diag(tb)) / sum(tb))) [1] 0.3262411

> library(rpart); library(mlbench); library(adabag)

> data(Vehicle)

> dim(Vehicle) [1] 846 19

> head(Vehicle)

Comp Circ D.Circ Rad.Ra Pr.Axis.Ra Max.L.Ra Scat.Ra Elong Pr.Axis.Rect Max.L.Rect Sc.Var.Maxis 1 95 48 83 178 72 10 162 42 20 159 176 ...

Sc.Var.maxis Ra.Gyr Skew.Maxis Skew.maxis Kurt.maxis Kurt.Maxis Holl.Ra Class 1 379 184 70 6 16 187 197 van ...

6 957 264 85 5 9 181 183 bus

> table(Vehicle$Class) bus opel saab van

218 212 217 199 > n <- nrow(Vehicle)

> sub <- sample(1:n, 2*n/3)

> Vehicle.train <- Vehicle[sub, ]

> Vehicle.test <- Vehicle[-sub, ]

(16)

adabag : An R Package for Classification with Boosting and Bagging

16/29

> library(adabag)

> Vehicle.adaboost <- boosting(Class ~., data = Vehicle.train, mfinal = mfinal, + control = rpart.control(maxdepth=maxdepth))

> Vehicle.adaboost.pred <- predict.boosting(Vehicle.adaboost, newdata = Vehicle.test)

> Vehicle.adaboost.pred$confusion Observed Class Predicted Class bus opel saab van

bus 69 2 3 1 opel 1 30 16 4 saab 1 38 39 1 van 2 6 3 66

> Vehicle.adaboost.pred$error [1] 0.2765957

> importanceplot(Vehicle.adaboost)

>

> # comparing error evolution in training and test set

> evol.train <- errorevol(Vehicle.adaboost, newdata = Vehicle.train)

> evol.test <- errorevol(Vehicle.adaboost, newdata = Vehicle.test)

> plot.errorevol(evol.test, evol.train)

> sort(Vehicle.adaboost$importance, dec=T)[1:5]

Max.L.Ra Sc.Var.maxis Max.L.Rect 29.623783 11.473254 9.956137

Elong Comp 7.570798 6.656360

Alfaro, E., Gamez, M. and Garcia, N. (2013): "adabag: An R Package for Classification with Boosting and Bagging".

Journal of Statistical Software, 54(2), 1–35.

(17)

Example: 10-fold CV adaboost.M1

> # 10-fold CV adaboost.M1

> Vehicle.boost.cv <- boosting.cv(Class ~., data = Vehicle, v = 10, mfinal = 5, + control = rpart.control(maxdepth=maxdepth)) i: 1 Tue Dec 05 09:36:37 2017

...

> Vehicle.boost.cv$confusion Observed Class Predicted Class bus opel saab van

bus 209 9 11 3 opel 1 101 72 2 saab 0 88 117 6 van 8 14 17 188

> Vehicle.boost.cv$error [1] 0.2730496

> Vehicle.bag.cv <- bagging.cv(Class ~., data = Vehicle, v = 10, mfinal = 5, + control = rpart.control(maxdepth=maxdepth))

> Vehicle.bag.cv$confusion Observed Class Predicted Class bus opel saab van

bus 208 20 24 2 opel 1 91 53 1 saab 1 81 116 5 van 8 20 24 191

> Vehicle.bag.cv$error [1] 0.2836879

# first tree

t1 <- adaboost$trees[[1]]

library(tree) plot(t1)

text(t1, pretty=0)

(18)

The Imbalanced Data Problem



A dataset is said to be unbalanced when the class of interest (minority class) is much rarer than normal behaviour (majority class).



The cost of missing a minority class is typically much higher that missing a majority class. Most learning systems are not prepared to cope with unbalanced data and several techniques have been proposed.



Example: 5% of the target class represents fraudulent transactions, 95% of the target class represents legitimate transactions.

18/29

http://www.srutisj.in/blog/research/statisticalmodeling/balancing-techniques-for-unbalanced-datasets-in-python-r/

(19)

unbalanced

Racing for Unbalanced Methods Selection

Re-balance or remove noisy instances in unbalanced datasets.

ubBalance {unbalanced}

Usage

ubBalance(X, Y, type="ubSMOTE", positive=1, percOver=200, percUnder=200,

k=5, perc=50, method="percPos", w=NULL, verbose=FALSE)

Arguments

X: the input variables of the unbalanced dataset.

Y: the response variable of the unbalanced dataset.

type: the balancing technique to use (ubOver, ubUnder, ubSMOTE, ubOSS, ubCNN, ubENN, ubNCL, ubTomek).

positive: the majority class of the response variable.

percOver: parameter used in ubSMOTE percUnder: parameter used in ubSMOTE

k: parameter used in ubOver, ubSMOTE, ubCNN, ubENN, ubNCL perc: parameter used in ubUnder

method: parameter used in ubUnder w: parameter used in ubUnder

verbose: print extra information (TRUE/FALSE)

ubSMOTE {unbalanced}: synthetic minority over-sampling technique

Usage

ubSMOTE(X, Y, perc.over = 200, k = 5, perc.under = 200, verbose = TRUE)

NOTE: ^imbalance : Preprocessing Algorithms for Imbalanced Datasets, Imbalanced

Classification in R: ROSE (Random Over Sampling Examples) and DMwR (Data Mining with R).

(20)

The Balancing Technique



ubOver : replicates randomly some instances from the minority class in order to obtain a final dataset with the same number of instances from the two classes.



ubUnder : removes randomly some instances from the majority

(negative) class and keeps all instances in the minority (positive) class in order to obtain a more balanced dataset.



ubCNN : Condensed Nearest Neighbor selects the subset of instances that are able to correctly classifying the original datasets using a one- nearest neighbor rule.



ubENN : Edited Nearest Neighbor removes any example whose class label differs from the class of at least two of its three nearest neighbors.



ubNCL : Neighborhood Cleaning Rule modifies the Edited Nearest Neighbor method by increasing the role of data cleaning.



Firstly, NCL removes negatives examples which are misclassified by their 3- nearest neighbors.



Secondly, the neighbors of each positive examples are found and the ones belonging to the majority class are removed.

20/29

(21)

The Balancing Technique

 ubTomek : finds the points in the dataset that are tomek link using 1-NN and then removes only majority class instances that are tomek links.

 ubOSS : One Side Selection is an undersampling method resulting from the application of Tomek links followed by the application of Condensed Nearest Neighbor.

x's nearest neighbor is y y's nearest neighbor is x x and y are different classes

http://contrib.scikit-learn.org/imbalanced-learn/stable/auto_examples/undersampling/plot_illustration_tomek_links.html

(22)

The Balancing Technique

 ubSMOTE : synthetic minority over-sampling technique

generates new examples by filling empty areas among the positive instances

 ubRacing : the Racing algorithm for selecting the best technique to re-balance or remove noisy instances in unbalanced datasets.

22/29

N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: Synthetic Minority Over-sampling Technique, Journal Of Artificial Intelligence Research, Volume 16, pages 321-357, 2002.(由 NV Chawla 著作 - 2002 - 被引用 5161 次)

(23)

Ionosphere dataset

ubIonosphere {unbalanced}



The datasets is a modification of Ionosphere dataset contained in "mlbench" package.

23/29

> # install.packages("unbalanced")

> library(unbalanced)

> p <- ncol(ubIonosphere)

> y <- ubIonosphere$Class

> x <- ubIonosphere[ ,-p]

> data <- ubBalance(X=x, Y=y, type="ubOver", k=0)

> overData <- data.frame(data$X, Class=data$Y)

> table(overData$Class) 0 1

225 225

> data <- ubBalance(X=x, Y=y, type="ubUnder", perc=50, method="percPos")

> underData <- data.frame(data$X, Class=data$Y)

> table(underData$Class) 0 1

126 126

> bdata <- ubBalance(X=x, Y=y, type="ubSMOTE", percOver=300, percUnder=150, verbose=TRUE) Proportion of positives after ubSMOTE : 47.06 % of 1071 observations

> str(bdata) List of 3

$ X :'data.frame': 1071 obs. of 32 variables:

..$ V3 : num [1:1071] -0.787 1 1 0.5 1 ...

...

..$ V34: num [1:1071] -0.576 0.714 -0.243 0.174 -0.892 ...

$ Y : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 2 1 2 ...

$ id.rm: logi NA

> table(bdata$Y) 0 1

567 504

> data(ubIonosphere)

> dim(ubIonosphere) [1] 351 33

> head(ubIonosphere)

V3 V4 V34 Class 1 0.99539 -0.05889 ... -0.45300 0 ...

6 0.02337 -0.00592 ... 0.12011 1

> table(ubIonosphere$Class) 0 1

225 126

K=0: sample with replacement from the minority class until we have the same number of instances in each class.

If K>0: sample with replacement from the minority class until we have k-times the orginal number of minority instances per.over/100: number of new instances generated for each rare instance

perc.under/100: number of "normal" (majority class) instances that are randomly selected for each smoted observation.

(24)

Compare the Performances using SVM ^24/29

> set.seed(12345)

> n <- nrow(ubIonosphere) # 351

> no.train <- floor(0.5*n) # 175, keep half for training and half for testing

> id <- sample(1:n, no.train)

> x.train <- x[id, ] # 175 x 32

> y.train <- y[id]

> x.test <- x[-id, ] # 176 32

> y.test <- y[-id]

>

> library(e1071)

> model1 <- svm(x.train, y.train)

> y.pred1 <- predict(model1, x.test)

> table(y.pred1, y.test) y.test

y.pred1 0 1 0 113 10 1 4 49

>

> # rebalance the training set before building a model

> balancedData <- ubBalance(X=x.train, Y=y.train, type="ubSMOTE", percOver=200, percUnder=150)

> table(balancedData$Y) 0 1

201 201 > model2 <- svm(balancedData$X, balancedData$Y)

> y.pred2 <- predict(model2, x.test)

> table(y.pred2, y.test) y.test

y.pred2 0 1 0 112 8 1 5 51

(25)

ubRacing {unbalanced}

Racing for Strategy Selection

> set.seed(1234)

> # load(url("http://www.ulb.ac.be/di/map/adalpozz/data/creditcard.Rdata"))

> load("creditcard.Rdata")

> str(creditcard)

'data.frame': 284807 obs. of 31 variables:

$ Time : num 0 0 1 1 2 2 4 7 7 9 ...

$ V1 : num -1.36 1.192 -1.358 -0.966 -1.158 ...

...

$ V28 : num -0.0211 0.0147 -0.0598 0.0615 0.2152 ...

$ Amount: num 149.62 2.69 378.66 123.5 69.99 ...

$ Class : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

> table(creditcard$Class) 0 1

284315 492

> # configuration of the sampling method used in the race

> ubConf <- list(percOver=200, percUnder=200, k=2, perc=50, method="percPos", w=NULL)

> # Race with 5 trees in the Random Forest

> results <- ubRacing(Class ~., creditcard, "randomForest",

+ positive=1, metric="auc", ubConf=ubConf, ntree=5)

Selecting the best technique to re-balance or remove noisy instances in unbalanced datasets.

ubRacing(formula, data, algo, positive=1, ncore=1, nFold=10, maxFold=10, maxExp=100, stat.test="friedman", metric="f1", ubConf, verbose=FALSE, ...)

Arguments

algo: the classification algorithm to use with the mlr package.

positive : label of the positive (minority) class.

ncore : the number of core (parallel execution) to use in the Race.

The function ubRacing

compares the 8 unbalanced

methods (ubUnder, ubOver,

ubSMOTE, ubOSS, ubCNN,

ubENN, ubNCL, ubTomek)

against the unbalanced

distribution.

(26)

Racing for Strategy Selection ^26/29

Racing for unbalanced methods selection in 10 fold CV

Number of candidates...9 Max number of folds in the CV...10 Max number of experiments...100 Statistical test...Friedman test

Markers:

x No test is performed.

- The test is performed and some candidates are discarded.

= The test is performed but no candidate is discarded.

+-+---+---+---+---+---+

+-+---+---+---+---+---+

|x| 1| 9| 4| 0.9543| 9|

|=| 2| 9| 3| 0.9433| 18|

|-| 3| 3| 4| 0.9567| 27|

|-| 4| 2| 4| 0.9566| 30|

|=| 5| 2| 4| 0.9582| 32|

|=| 6| 2| 4| 0.9546| 34|

|=| 7| 2| 4| 0.9531| 36|

|=| 8| 2| 4| 0.9539| 38|

|=| 9| 2| 4| 0.9531| 40|

|=| 10| 2| 4| 0.9529| 42|

+-+---+---+---+---+---+

Selected candidate: ubSMOTE metric: auc mean value: 0.9529

(27)

Racing for Strategy Selection

> results

$best

[1] "ubSMOTE"

$avg

[1] 0.9529177

$sd

[1] 0.009049014

$N.test [1] 42

$Gain [1] 53

$Race

unbal ubOver ubUnder ubSMOTE ubOSS ubCNN ubENN ubNCL ubTomek [1,] 0.8844582 0.9138946 0.9354739 0.9543104 0.8957273 0.9139340 0.9024656 0.9014143 0.9048642 [2,] 0.9116642 0.9104928 0.9511485 0.9507221 0.9037491 0.9104840 0.9139047 0.9094542 0.9105558 [3,] 0.8979478 0.9013642 0.9502417 0.9649361 0.9092505 0.9081796 0.9103668 0.9036617 0.9058917 [4,] NA NA 0.9503782 0.9564226 NA NA 0.8999928 NA NA [5,] NA NA 0.9537802 0.9647722 NA NA NA NA NA [6,] NA NA 0.9494913 0.9362763 NA NA NA NA NA [7,] NA NA 0.9411979 0.9440379 NA NA NA NA NA [8,] NA NA 0.9576971 0.9594249 NA NA NA NA NA [9,] NA NA 0.9530119 0.9473722 NA NA NA NA NA [10,] NA NA 0.9633438 0.9509024 NA NA NA NA NA

> # Race using 4 cores and 500 trees (default)

> results <- ubRacing(Class ~., creditcard, "randomForest",

positive=1, metric="auc", ubConf=ubConf, ncore=4)

> library(e1071)

> results <- ubRacing(Class ~., creditcard, "svm", positive=1, ubConf=ubConf)

> library(rpart)

> results <- ubRacing(Class ~., creditcard, "rpart", positive=1, ubConf=ubConf)

(28)

Useful R Packages ^28/29

mlr: Machine Learning in R

https://cran.r-project.org/web/packages/mlr/vignettes/mlr.html

DMwR: Functions and data for "Data Mining with R"

https://cran.r-project.org/web/packages/DMwR/index.html

XGBoost: eXtreme Gradient Boosting

(used for supervised learning tasks such as Regression, Classification, and Ranking)

https://github.com/dmlc/xgboost

http://xgboost.readthedocs.io/en/latest/

How to use XGBoost algorithm in R in easy steps

https://www.analyticsvidhya.com/blog/2016/01/xgboost-algorithm-easy-steps/

Kaggle 神器 XGBoost 入門：為什麼要用它?怎麼用?

https://weiwenku.net/d/100778240

imbalance

: Preprocessing Algorithms for Imbalanced Datasets https://cran.r-project.org/web/packages/imbalance/index.html Working with imbalanced datasets

https://cran.r-project.org/web/packages/imbalance/vignettes/imbalance.pdf

(29)

Reference



Chawla, Nitesh V. (2010) Data Mining for Imbalanced Datasets: An Overview, In: Maimon, Oded; Rokach, Lior (Eds) Data Mining and

Knowledge Discovery Handbook, Springer ISBN 978-0-387-09823-4 (pages 875–886)



Dealing with unbalanced data in machine learning

https://shiring.github.io/machine_learning/2017/04/02/unbalanced



Dealing with imbalanced data: undersampling, oversampling and proper cross-validation

https://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling- oversampling-and-proper-cross-validation



Practical Guide to deal with Imbalanced Classification Problems in R

https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced- classification-problems/



整合學習

http://www.hmwu.idv.tw

吳漢銘 國立政治大學 統計學系

整合學習

Ensemble Learning

C05

本章大綱&學習目標

 Machine Learning

 Resampling methods

 Jackknife (leave-one-out)

 Bootstrapping

 Ensemble Learning

 bagging

 boosting

 Imbalanced Data Problem

 under-sampling

 over-sampling

2/29

Common Machine Learning Algorithms

Linear Regression, Logistic Regression,

Decision Tree, SVM, Naive Bayes, KNN,

K-Means, Random Forest, Dimensionality

Reduction, Boosting

Difference between

Machine Learning & Statistical Modeling

3/29

https://www.analyticsvidhya.com/blog/2015/07/difference-machine-learning-statistical-modeling/

• Machine Learning is an algorithm that can learn from data without relying on rules-based programming.

• Statistical Modelling is the formalization of relationships between variables in the

form of mathematical equations.

Why Ensemble Learning? 4/29

Training data

y

y

y

ŷ

testing data y

replicates or

various algorithms

Why Resampling?

Resampling is any of a variety of methods for:

Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping).

Exchanging labels on data points when performing significance tests (permutation tests, randomization tests)

Validating models by using random subsets (bootstrapping, cross validation)

https://en.wikipedia.org/wiki/Resampling_(statistics)

This single sample method can serve as a mini population, from which repeated small samples are drawn with replacement over and over again.

As well as saving time and money, bootstrapped samples can be

quite good approximations for population parameters.

Jackknife Resampling 6/29

the calculated estimator of the parameter based on all n observations the average of these "leave-one-out" estimates

the resulting bias-corrected jackknife estimate

Example: Jackknife Estimate the Coefficient of Variation

Jackknifing can be useful for analyzing

if influential observations are affecting

our estimates.

Jackknife the Coefficients of a Linear Regression Model

8/29

Bootstrap Methods

Bootstrapping is a statistical method for estimating the sampling distribution of an

estimator by sampling with replacement from the original sample, of the same size as the original sample.

The name "bootstrapping" comes from the phrase:

"To lift himself up by his bootstraps".

This refers to something that is preposterous and impossible.

Try as hard as you can, you cannot lift yourself into the air by tugging at pieces of leather on your boots.

Bradley Efron 1938~

Department of Statistics

Stanford University

Bootstrapping 10/29

.

Bootstrap Sample

Bootstrap Statistics Statistic 1

Statistic 2*

Statistic B*

Bootstrap Distribution sampling

with replacement

Sample

Statistic ..

• Types of bootstrap scheme: Case

resampling, Bayesian bootstrap, Smooth bootstrap, Parametric bootstrap,

Resampling residuals, Gaussian process regression bootstrap, Wild bootstrap, Block bootstrap.

吳漢銘國立政治大學統計學系

Why Ensemble Learning? ^4/29

Jackknife Resampling ^6/29

Bootstrapping ^10/29

bootstrap Package ^12/29

Boosting ^14/29

NOTE: ^imbalance : Preprocessing Algorithms for Imbalanced Datasets, Imbalanced

Compare the Performances using SVM ^24/29