Actual Response

(1)

http://www.hmwu.idv.tw

吳漢銘

國立政治大學統計學系

分類法則

Classification

C03

(2)

本章大綱&學習目標

 Introduction

 Classification of Subjects or Samples (Supervised Learning)

 Performance Measures (評估指標)

 Methods



K-Nearest Neighbors (KNN) ( k

最近鄰居法)



Linear/Quadratic Discriminant Analysis (LDA/QDA) (區別分析)



Classification Tree (Decision Tree) (分類樹、決策樹)



Random Forest (隨機森林)



Support Vector Machine (SVM) (支持向量機)



Artificial Neural Network (ANN) (人工神經網路)



Ensemble Learning (整合學習)

http://www.hmwu.idv.tw/web/R/C05-hmwu_R-EnsembleLearning.pdf



XGBoost: eXtreme Gradient Boosting

(極限梯度提升)

2/143

(3)

R Packages

CRAN Task View: Machine Learning & Statistical Learning http://cran.r-project.org/web/views/MachineLearning.html



Topics: Neural Networks, Recursive Partitioning, Random Forests, Regularized and Shrinkage Methods, Boosting, Support Vector Machines and Kernel

Methods, Bayesian Methods, Optimization using Genetic Algorithms, Association Rules, Fuzzy Rule-based System, Model selection and validation, Meta packages, Elements of Statistical Learning.



knn (最近k鄰居分類法)

 class: Functions for Classification



lda (線性區別分析)

 MASS: Support Functions and Datasets for Venables and Ripley's MASS



Decision Tree (決策樹)

 C50: C5.0 Decision Trees and Rule-Based Models

 rpart: Recursive Partitioning and Regression Trees

 party: A Laboratory for Recursive Partytioning

 caret: Classification and Regression Training



SVM (支持向量機)

 e1071: Misc Functions of the Department of Statistics (e1071), TU Wien

http://cran.r-project.org/web/packages/package-name

3/143

(4)

What is Classification?

 Classification

 Clustering (unsupervised learning) (群集分析、非監督式學習)

 Discriminant Analysis (supervised learning, classification) (區別分析、監督式學習、分類法則)

 Discriminant Analysis

 It focuses on situations where the different groups (clusters) are known a priori.

 Decision rules are provided in classifying a multivariate observation into one of the known groups.

 Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y.

4/143

(5)

Class Prediction Analysis



Class prediction analysis is designed to predict the value, or “class”, of an individual parameter in an uncharacteristic sample or set of samples.



Apply classification to microarray data



Predict cancer types using genomic expression profiling.



Predict the class/phenotype/parameter of a sample.



Identify genes that discriminate well among classes



Identify samples that could be potential outliers.



Examples of classification task



Classifying credit card transactions as legitimate or fraudulent



Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil.



Categorizing news stories as finance, weather, entertainment, sports, etc.



Fernandez-Delgado, M., Cernadas, E., Barro, S. & Amorim, D. Do we need hundreds of classifiers to solve real world classification problems?

Journal of Machine Learning Research 15, 3133–3181 (2014). [ 被引用 1883 次] (179 classifiers on 121 data sets)

5/143

(6)

Classification of Genes, Tissues or Samples ^6/143

(7)

n-fold Cross-Validation Error Rates ^7/143

(8)

Split Data into Training Set and Test Set

> id <- sample(2, nrow(iris), replace = TRUE, prob = c(0.9, 0.1))

> id

[1] 1 1 2 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 [38] 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [75] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 [112] 1 1 1 1 1 1 1 2 1 2 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 2 2 2 1 1 [149] 1 1

> train.data <- iris[id == 1, ]

> dim(train.data) [1] 131 5

> test.data <- iris[id == 2, ]

> dim(test.data) [1] 19 5

> id <- sample(nrow(iris), floor(nrow(iris) * 0.9))

> id

[1] 39 27 96 33 4 98 12 3 32 48 2 22 18 24 126 93 140 85 110 60 [21] 62 91 131 35 134 143 29 108 114 50 19 43 45 66 36 90 105 76 127 92 [41] 68 57 65 147 69 41 130 82 31 20 51 17 149 61 107 70 139 5 115 72 [61] 78 118 117 38 15 74 120 111 106 11 104 67 13 21 133 42 87 121 122 40 [81] 84 135 123 77 83 97 52 116 55 88 142 16 7 49 125 112 34 10 56 26 [101] 99 63 37 46 144 9 141 59 138 80 101 132 129 113 73 30 44 136 119 79 [121] 95 64 109 148 28 14 86 150 137 81 94 75 128 102 124

> train.data <- iris[id, ]

> dim(train.data) [1] 135 5

> test.data <- iris[-id, ]

> dim(test.data) [1] 15 5

8/143

(9)

Split Data into Training Set and Test Set

> splits <- splitdf(iris, 0.9, 12345)

> lapply(splits, dim)

$trainset [1] 135 5

$testset [1] 15 5

> iris.training <- splits$trainset

> iris.testing <- splits$testset

splitdf <- function(df, train.ratio, seed=NULL) { if (!is.null(seed)) set.seed(seed)

index <- 1:nrow(df)

**id <- sample(index, trunc(length(index)*train.ratio)) train <- df[id, ]**

test <- df[-id, ]

list(trainset=train,testset=test) }

library(dplyr)

iris.train <- sample_frac(iris, 0.9) id <- as.numeric(rownames(iris.train)) iris.test <- iris[-id, ]

https://cran.r-project.org/web/packages/dataPreparation/vignettes/train_test_prep.html

9/143

(10)

Split Data into Test and Training Set According to Group Labels

> library(caTools)

> Y <- iris[,5] # extract labels from the data

> msk <- sample.split(Y, SplitRatio=4/5)

> msk

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE ...

[144] TRUE TRUE TRUE FALSE TRUE TRUE FALSE

> table(Y, msk) msk

Y FALSE TRUE setosa 10 40 versicolor 10 40 virginica 10 40

> iris.train <- iris[msk, ]

> iris.test <- iris[!msk, ]

> dim(iris.train) [1] 120 5

> dim(iris.test) [1] 30 5

require(caTools) set.seed(12345)

id <- sample.split(1:nrow(iris), SplitRatio = 0.90) iris.train <- subset(iris, id == TRUE)

iris.test <- subset(iris, id == FALSE) library(caret)

id <- createDataPartition(y=iris$Species, p=0.9, list=FALSE) iris.train <- iris[id, ]

iris.test <- iris[-id, ]

> library(caret)

> createFolds(iris$Species, k=3)

$Fold1

[1] 2 8 15 22 25 27 30 ...

$Fold2

[1] 5 6 9 10 11 12 17 ...

$Fold3

[1] 1 3 4 7 13 14 16 20...

10/143

(11)

Performance Measures

(TP)

(TN) (FP)

(FN)

True Predict Y Y

0 0

0 0

0 1

1 1

1 0

1 1 . . . . . .

. .

Binary Classifier Confusion Matrix

11/143

(12)

Confusion Matrix for Binary Response

Total: N True (1) False (0)

True (1) True Positive TP

False Negative FN

Type II Error

False (0) False Positive FP

Type I Error

True Negative TN

Predicted Response

Actual Response

TP TP FN

FN TP FN

FP FP TN

TN FP TN

TPR Recall

Sensitivity FNR

FPR TNR, Specificity

TP TP FP

PPV, Precision

FP TP FP

FDR

FN FN TN

TN FN TN

FOR

NPV

Prevalence

TP FN N

Accuracy =

F1 Score =

Misclassification rate =

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

12/143

(13)

Receiver Operating Characteristic (ROC) Curve

http://en.wikipedia.org/wiki/File:Roc.png

 A diagonal line is a baseline, where a random prediction would lie.

 The closer the point is to the upper top left point in the plot, the better the prediction.

Empirical estimated ROC curve

TPR TP

TP FN FPR FP FP TN AUC (Area Under Curve)

13/143

(14)

Some R Packages for ROC Curves



ROCR [2005]: Visualizing the Performance of Scoring Classifiers

 Sing T, Sander O, Beerenwinkel N, Lengauer T. (2005) ROCR: visualizing classifier performance in R. Bioinformatics 21(20):3940-1.

 https://ipa-tys.github.io/ROCR/



pROC [2010]: Display and Analyze ROC Curves

[Multi-class AUC]

 Robin, X., Turck, N., Hainard, A. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).



PRROC [2014]*: Precision-Recall and ROC Curves for Weighted and Unweighted Data



plotROC [2014]*: Generate Useful ROC Curve Charts for Print and Interactive Use

 Example: https://mlr.mlr-org.com/articles/tutorial/roc_analysis.html



precrec [2015]*: Calculate Accurate Precision-Recall and ROC (Receiver Operator Characteristics) Curves

[Multiple models and multiple test sets]



multiROC[2018]: Calculating and Visualizing ROC and PR Curves Across Multi-Class Classifications



ROCit [2019]*: Performance Assessment of Binary Classifier with Visualization

*Vignettes

NOTE: the ROC curves are typically used in binary classification but not for multiclass classification problems.

For multi-class ROC/AUC:

• Fieldsend, Jonathan & Everson, Richard. (2005). Formulation and comparison of multi-class ROC surfaces.

• Hand, D.J., Till, R.J. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning 45, 171–186 (2001). [ 被引用 1639 次]

• How to plot ROC curves in multiclass classification?

https://stats.stackexchange.com/questions/2151/how-to-plot-roc-curves-in-multiclass-classification/2155#2155

14/143

(15)

ROCR : Visualizing the Performance of Scoring Classifiers

> library(ROCR) # ROCR supports only binary classification

> data(ROCR.simple) # n = 200

> ROCR.simple

$predictions

[1] 0.612547843 0.364270971 0.432136142 0.140291078 0.384895941 0.244415489 0.970641299 ...

[197] 0.858970526 0.383807972 0.606960209 0.138387070

$labels

[1] 1 1 0 0 0 1 1 1 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 ...

[173] 0 1 1 1 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 1 0

> pred <- prediction(ROCR.simple$predictions, ROCR.simple$labels)

> pred

An object of class "prediction"

Slot "predictions":

[[1]]

[1] 0.612547843 0.364270971 0.432136142 0.140291078 0.384895941 0.244415489 0.970641299 ...

[197] 0.858970526 0.383807972 0.606960209 0.138387070

Slot "labels":

[[1]]

[1] 1 1 0 0 0 1 1 1 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 ...

[173] 0 1 1 1 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 1 0 Levels: 0 < 1

15/143

(16)

ROCR : Visualizing the Performance of Scoring Classifiers

Slot "cutoffs":

[[1]]

[1] Inf 0.991096434 0.984667270 0.984599159 0.983494405 0.970641299 0.959417835 ...

Slot "fp": # a vector of the number of false positives induced by the cutoffs [[1]]

[1] 0 0 0 0 1 1 2 3 3 3 3 3 3 3 4 4 4 4 4 ...

Slot "tp":

[[1]]

[1] 0 1 2 3 3 4 4 4 5 6 7 8 9 10 10 11 12 13 14 15 16 17 17 18 19 ...

Slot "tn":

[[1]]

[1] 107 107 107 107 106 106 105 104 104 104 104 104 104 104 103 103 103 103 103 ...

Slot "fn":

[[1]]

[1] 93 92 91 90 90 89 89 89 88 87 86 85 84 83 83 82 81 80 79 78 77 76 76 75 74 ...

Slot "n.pos": # contains the number of positive samples in the given x-validation run [[1]]

[1] 93

Slot "n.neg":

[[1]]

[1] 107

Slot "n.pos.pred":

[[1]]

[1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...

Slot "n.neg.pred":

[[1]]

[1] 200 199 198 197 196 195 194 193 192 191 190 189 188 187 186 185 184 183 182 18 ...

the 2x2 contingency table consisting of tp , tn , fp ,and fn , along with the marginal sums n.pos , n.neg , n.pos.pred , n.neg.pred .

16/143

(17)

ROCR : Visualizing the Performance of Scoring Classifiers

> perf <- performance(pred, measure = "tpr", x.measure = "fpr") # predictor evaluation

> perf

An object of class "performance"

Slot "x.name":

[1] "False positive rate"

Slot "y.name":

[1] "True positive rate"

Slot "alpha.name":

[1] "Cutoff"

Slot "x.values":

[[1]]

[1] 0.000000000 0.000000000 0.000000000 0.000000000 0.009345794 0.009345794 0.018691589 ...

Slot "y.values":

[[1]]

[1] 0.00000000 0.01075269 0.02150538 0.03225806 0.03225806 0.04301075 0.04301075 ...

Slot "alpha.values":

[[1]]

[1] Inf 0.991096434 0.984667270 0.984599159 0.983494405 0.970641299 0.959417835 ...

17/143

(18)

ROCR : Visualizing the Performance of Scoring Classifiers

> perf <- performance(pred, measure="tpr", x.measure="fpr")

> plot(perf, colorize=TRUE, lwd=3, asp=1, main="ROC curve")

NOTE: use add = TRUE

to add more curves on the plot e.g., plot(svm.perf, add = TRUE)

Plot ROC curves to compare multiple classifiers (using ROCR)

https://rpubs.com/JanpuHou/359286

abline(a=0, b=1)

18/143

(19)

ROCR : Visualizing the Performance of Scoring Classifiers

> # more examples

> demo(ROCR)

> acc <- performance(pred, measure = "acc")

> str(acc)

Formal class 'performance' [package "ROCR"] with 6 slots ..@ x.name : chr "Cutoff"

..@ y.name : chr "Accuracy"

..@ alpha.name : chr "none"

..@ x.values :List of 1

.. ..$ : num [1:201] Inf 0.991 0.985 0.985 0.983 ...

..@ y.values :List of 1

.. ..$ : num [1:201] 0.67 0.675 0.68 0.685 0.68 ...

..@ alpha.values: list()

> [email protected][[1]]

[1] 0.670 0.675 0.680 0.685 0.680 0.685 0.680...

> auc <- performance(pred, measure = "auc")

> str(auc)

> [email protected][[1]]

[1] 0.6957259

19/143

(20)

Example: Evaluate svm classifier using ROCR

More example:

https://rpubs.com/JanpuHou/359286

# mlbench: Machine Learning Benchmark Problems

#install.packages("mlbench") library(mlbench)

data(BreastCancer) # Wisconsin Breast Cancer Database dim(BreastCancer) # 699 x 11

levels(BreastCancer$Class) # "benign(良性)", "malignant(惡性)"

head(BreastCancer)

? BreastCancer

summary(BreastCancer)

x <- as.data.frame(lapply(BreastCancer[, -c(1, 7:11)], as.numeric)) dim(x) # 699 5

y <- BreastCancer$Class n <- nrow(x)

p <- ncol(x)

id <- sample(1:n, n*0.8) length(id)

x.train <- x[id, ] x.test <- x[-id, ] y.train <- y[id]

y.test <- y[-id]

library(e1071)

model <- svm(x.train, y.train) summary(model)

pred.1 <- predict(model, x.test) table(y.test, pred.1)

# compute decision values and probabilities:

pred.2 <- predict(model, x.test, decision.values = TRUE) str(pred.2)

pred.values <- attr(pred.2, "decision.values") pred.values

# since Levels: benign < malignant

pred.svm <- prediction(-pred.values, y.test) perf.svm <- performance(pred.svm, "tpr", "fpr") plot(perf.svm, lwd=3, main="ROC Curve")

abline(a=0, b=1, col="grey")

legend(0.6, 0.4, "svm", lty=1, lwd=3)

auc <- performance(pred.svm, measure = "auc") [email protected][[1]]

20/143

(21)

K-Nearest Neighbors

The number of k-nearest neighbors is user-defined.

1. Counts the k-nearest samples (in Euclidean distance) in the training set to the new sample to be classified.

2. Determines the proportion of neighbor samples from each class and makes a ‘vote’ for each class.

3. Calculates p-values for the likelihood of observed representation of each class.

4. Computes the ratio between the p-value of the most highly represented class and the

p-value of the next most highly represented class.

5. Allows “no prediction” result if differential between p-values is above Decision cutoff for P-value ratio.

wikipedia

21/143

(22)

knn {class}

k-Nearest Neighbour Classification

class: Functions for Classification

Various functions for classification, including k-nearest neighbour, Learning Vector Quantization and Self-Organizing Maps.

> library(class)

> sel <- sample(1:50, 30)

> iris.train <- rbind(iris3[sel,,1], iris3[sel,,2], iris3[sel,,3])

> iris.test <- rbind(iris3[-sel,,1], iris3[-sel,,2], iris3[-sel,,3])

> y.train <- factor(c(rep("s", 30), rep("c", 30), rep("v", 30)))

> y.test.true <- factor(c(rep("s", 20), rep("c", 20), rep("v", 20)))

> y.test.pred <- knn(train=iris.train, test=iris.test, cl=y.train, k = 3)

> ct <- table(y.test.true, y.test.pred)

> ct

y.test.pred y.test.true c s v

c 18 0 2 s 0 20 0 v 1 0 19

> (accuracy <- sum(diag(ct))/sum(ct)) [1] 0.95

> iris3[1:3,,]

, , Setosa

Sepal L. Sepal W. Petal L. Petal W.

[1,] 5.1 3.5 1.4 0.2 [2,] 4.9 3.0 1.4 0.2 [3,] 4.7 3.2 1.3 0.2 , , Versicolor

[1,] 7.0 3.2 4.7 1.4 [2,] 6.4 3.2 4.5 1.5 [3,] 6.9 3.1 4.9 1.5 , , Virginica

[1,] 6.3 3.3 6.0 2.5 [2,] 5.8 2.7 5.1 1.9 [3,] 7.1 3.0 5.9 2.1

22/143

(23)

Apply KNN to Microarray Data



#Samples : Bone marrow



#ALL (acute lymphoblastic leukemia): 27 patients (急性淋巴細胞白血病)



#AML (acute myeloid leukemia): 11 patients (急性骨髓性白血病)



#Genes : 7070 genes.

Golub, T.R et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531--537. Cancer Genomics Program at Whitehead Institute for Genome Research

http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi

被引用 12815 次

23/143

(24)

Linear Discriminant Analysis (LDA)



LDA (Fisher, 1936) finds the linear combinations w

^T

x of x = ( x ₁ ,…, x _p ) with large ratios of between-groups to within-groups sum of squares.



LDA is a supervised method for dimension reduction for classification problem.



Given samples from two classes C1 and C2 , we want to find the direction, as defined by a vector w , such that when the data are projected onto w , the examples from the two classes are as well separated as possible.

Source: https://rpubs.com/Nolan/298913

24/143

(25)

LDA: Methodology

^(1/3) ^25/143

(26)

^(2/3) ^26/143

(27)

^(3/3)

 Because it is the direction that is important for us and not the

magnitude, set c=1 and find w

NOTE:

27/143

(28)

LDA: More Than 2 Classes

^28/143

(29)

LDA: Classification

 Fisher’s linear discriminant is optimal if the classes are normally distributed.

 After projection, for the two classes to be well separated, we would like the means to be as far apart as possible and the examples of classes be scatteres in as small a region as possible.

29/143

(30)

LDA Assumptions

 The predictors are multivariate normal within

groups. This assumption implies that the predictors have linear relationships.

 Homogeneity of Covariance (within groups).

 Independence

aq.plot {mvoutlier} : Detect Outliers

shapiro.test {stats} : Shapiro-Wilk Normality Test

mshapiro.test {mvnormtes} : Normality test for multivariate variables bartlett.test {stats} : Test of Homogeneity of Variances

# LDA in R

lda {MASS}

partimat {klaR}

train {caret}

LDA {flipMultivariates}

discrimin {ade4}

30/143

(31)

LDA in R:

lda {MASS}

> library(MASS)

> data <- iris[,1:4]

> class <- iris[,5]

> iris.lda <- lda(x=data, grouping=class)

> # sam as iris.lda <- lda(Species ~ ., iris)

> iris.lda Call:

lda(data, grouping = class) Prior probabilities of groups:

setosa versicolor virginica 0.3333333 0.3333333 0.3333333 Group means:

Sepal.Length Sepal.Width Petal.Length Petal.Width setosa 5.006 3.428 1.462 0.246 versicolor 5.936 2.770 4.260 1.326 virginica 6.588 2.974 5.552 2.026 Coefficients of linear discriminants:

LD1 LD2 Sepal.Length 0.8293776 0.02410215 Sepal.Width 1.5344731 2.16452123 Petal.Length -2.2012117 -0.93192121 Petal.Width -2.8104603 2.83918785 Proportion of trace:

LD1 LD2 0.9912 0.0088

> plot(iris.lda, col=as.integer(class)+1)

> fit <-lda(x=data, grouping=class, CV=TRUE)

> (ct <- table(class, fit$class))

class setosa versicolor virginica setosa 50 0 0 versicolor 0 48 2 virginica 0 1 49

> diag(prop.table(ct, 1))

setosa versicolor virginica 1.00 0.96 0.98

> # total percent correct

> sum(diag(prop.table(ct))) [1] 0.98

> iris.lda.predict <- predict(iris.lda)

31/143

(32)

LDA in R:

lda {MASS}

> lda.dim1 <- as.matrix(data)%*%iris.lda$scaling[,1]

> lda.dim2 <- as.matrix(data)%*%iris.lda$scaling[,2]

> plot(lda.dim1, lda.dim2, col=class, asp=1)

>

> ## LDA for classification

> set.seed(123456)

> trainingIndex <- sample(1:150, 75)

> trainingSample <- iris[trainingIndex, ]

> testSample <- iris[-trainingIndex, ]

> table(iris$Species[trainingIndex]) setosa versicolor virginica 24 28 23

>

> ldaRule <- lda(Species ~ ., iris, subset = trainingIndex)

> # plot(ldaRule)

> ldaRule.predict <- predict(ldaRule, testSample)

> names(ldaRule.predict)

[1] "class" "posterior" "x"

> ldahist(data = ldaRule.predict$x[,1], g=testSample$Species)

> # plot(ldaRule, dimen = 1, type = "b")

>

> table(testSample$Species, ldaRule.predict$class) setosa versicolor virginica

setosa 26 0 0 versicolor 0 21 1 virginica 0 1 26

32/143

(33)

Dudoit S., J. Fridlyand, and T. P. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. JASA 97 (457), 77-87.

練習: Apply LDA to Microaray Data (select genes)

lymphoma_62x4026.txt

> lymphoma <- read.table("lymphoma_62x4026.txt", sep="\t", row.names=1)

> dim(lymphoma) [1] 62 4027

> lymphoma[1:5, 1:4]

V2 V3 V4 V5 1 0 -0.3780 -0.7255 -0.5349 2 0 -1.0103 -0.9069 -0.4071 3 0 0.2696 0.1540 0.2696 4 0 -0.6809 -0.9298 -0.6809 5 0 -0.7706 -0.9311 -1.0382

> group <- lymphoma[, 1]

> xdata.orig <- lymphoma[, 2:ncol(lymphoma)]

33/143

(34)

Heatmap for Lymphoma Microarray Data

> library(fields)

> gbr <- two.colors(start="green", + middle="black", + end="red")

> gcol <- c("red", "blue", "green")

> xdata <- xdata.orig

> range(xdata) [1] -9.542 9.415

> xdata[xdata > 5] <- 5

> xdata[xdata < -5] <- -5

> heatmap(as.matrix(xdata), col = gbr, + Rowv=NULL,

+ RowSideColors = gcol[group+1], + margins = c(5,10),

+ xlab = "genes", + ylab = "subjects",

+ main = "lymphoma_62x4026")

34/143

(35)

Gene Selection

> bw.values <- apply(xdata.orig, 2, bw.ratio, group)

> top <- 50

> selected.genes <- order(bw.values,

decreasing = TRUE)[1:top]

> xdata.selected <- xdata.orig[, selected.genes]

> range(xdata.selected) [1] -6.127 9.415

> xdata.selected[xdata.selected > 5] <- 5

> xdata.selected[xdata.selected < -5] <- -5

> heatmap(as.matrix(xdata.selected), col = gbr, Rowv=NULL, + RowSideColors = c("red", "blue", "green")[group+1], + margins = c(5,10),

+ xlab = "genes", ylab = "subjects", main = "lymphoma_62x50") bw.ratio <- function(x, y){

tg <- table(y)

gm <- tapply(x, y, mean) repm <- rep(gm, tg)

wss <- sum((x - repm)^2) bss <- sum((gm-mean(x))^2) bw <- bss/wss

}

35/143

(36)

練習: Apply LDA to Microaray Data

^36/143

(37)

Why Discriminant Analysis:

Compare with logistic regression



When the classes of the response variable Y are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. LDA & QDA do not suffer from this problem.



If n is small and the distribution of the predictors X is approximately normal in each of the classes, the LDA & QDA models are again more stable than the logistic regression model.



LDA & QDA are often preferred over logistic regression when we have more than two non-ordinal response classes.



LDA & QDA have assumptions that are often more restrictive then logistic regression.

Source: http://uc-r.github.io/discriminant_analysis

37/143

(38)

The Assumptions of LDA & QDA

 Both LDA and QDA assume the the predictor variables X are drawn from a multivariate Gaussian distribution.

 LDA assumes equality of covariances among the predictor variables X across each all levels of Y . This assumption is relaxed with the QDA model.

 LDA and QDA require the number of predictor variables ( p ) to be less then the sample size ( n ).

 The performance will severely decline as p approaches n .

 A simple rule of thumb is to use LDA & QDA on data sets where n ^≥ 5×p .

Source: http://uc-r.github.io/discriminant_analysis

38/143

(39)

Compare LDA with QDA



LDA is a much less flexible classifier than QDA, and so has substantially lower variance.



If LDA's assumption that the predictor variable share a common variance across each Y response class is badly off, then LDA can suffer from high bias.



LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial.



In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix is clearly

untenable.

Source: http://uc-r.github.io/discriminant_analysis

39/143

(40)

HDLSS Problems in LDA

 In the high dimensional setting ( p >> n ) LDA is not appropriate for two reasons:



First, the standard estimate for the within-class covariance matrix is singular , and so the usual discriminant rule cannot be applied.



Second, when p is large, it is difficult to interpret the classification rule that is obtained from LDA, since the classification rule involves a linear combination of all p features.

• Friedman, J. H. (1989). Regularized discriminant analysis. J. Am. Stat. Assoc. 84, 165–175.

• Hastie, T., Buja, A. and Tibshirani, R. (1995) Penalized discriminant analysis. Ann. Statist., 23, 73–102.

• Guo, Y., Hastie, T. and Tibshirani, R. (2007) Regularized linear discriminant analysis and its application in microarrays. Biostatistics, 8, 86–100.

• Daniela M.Witten and Robert Tibshirani, 2011, Penalized classification using Fisher's linear discriminant, J. R.

Statist. Soc. B, 73(5), 753–772.

40/143

(41)

Classical LDA

^41/143

(42)

Positive Definite Estimate



A matrix is "positive definite" if all of its eigenvalues are positive.



Sample covariance matrix is always positive semi-definite.



Sample covariance matrix is positive definite: full rank.



If the input covariance or correlation matrix being analyzed is not positive definite:



Generalized least squares (GLS) estimation requires that the covariance or correlation matrix analyzed must be positive definite, and



maximum likelihood (ML) estimation will also perform poorly in such situations.

• Krzanowski, W. J., Jonathan, P., McCarthy, W. V. and Thomas, M. R. (1995) Discriminant analysis with singular covariance matrices: methods and applications to spectroscopic data. Appl. Statist., 44, 101-115.

• Xu, P., Brock, G. and Parrish, R. (2009) Modified linear discriminant analysis approaches for classification of high- dimensional microarray data. Computnl Statist. Data Anal., 53, 1674-1687.

42/143

(43)

General Form of

Penalized Linear Discriminant Analysis

(least absolute shrinkage and selection operator)

43/143

(44)

> library(penalizedLDA)

> library(fields)

>

> set.seed(12345)

> # an example modified from penalizedLDA package

> n <- 20

> p <- 100

> x.train <- matrix(rnorm(n*p), ncol=p)

> x.test <- matrix(rnorm(n*p), ncol=p)

> y <- c(rep(1, 5), rep(2, 5), rep(3, 10))

>

> x.train[y==1, 1:10] <- x.train[y==1, 1:10] + 2

> x.train[y==2, 11:20] <- x.train[y==2, 11:20] - 2

>

> x.test[y==1, 1:10] <- x.test[y==1, 1:10] + 2

> x.test[y==2, 11:20] <- x.test[y==2, 11:20] - 2

>

> heatmap(x.train, Rowv=NA, Colv=NA, RowSideColors=c("red", "blue", "green")[y], + col=two.colors(start="darkblue", middle="white", end="darkred"))

PenalizedLDA {penalizedLDA}

Usage

PenalizedLDA(x, y, xte=NULL, type = "standard", lambda, K = 2, chrom = NULL, lambda2 = NULL, standardized = FALSE, wcsd.x = NULL, ymat = NULL, maxiter = 20, trace=FALSE)

Solve Fisher's discriminant problem in high-dimensions using (a) a diagonal estimate of the within-class covariance matrix, and (b) lasso (type="standard") or fused lasso

(type="ordered") penalties on the discriminant vectors.

44/143

(45)

PenalizedLDA {penalizedLDA}

> fit.plda <- PenalizedLDA(x.train, y, x.test, lambda=0.14, K=2)

> print(fit.plda)

Number of discriminant vectors: 2

Number of nonzero features in discriminant vector 1 : 34 Number of nonzero features in discriminant vector 2 : 43 Total number of nonzero features: 54

Details:

Type: standard Lambda: 0.14

> plot(fit.plda)

> str(fit.plda) List of 12

$ ypred : int [1:20, 1:2] 1 1 1 1 1 2 3 2 2 2 ...

$ discrim: num [1:100, 1:2] -0.412 -0.012 -0.0738 -0.1294 ...

$ xproj : num [1:20, 1:2] -5.74 -4.86 -4.24 -5.71 -5.94 ...

$ xteproj: num [1:20, 1:2] -3.42 -3.67 -3.94 -3.32 -3.65 ...

$ K : num 2

$ crits :List of 2

..$ : num [1:13] -0.931 3.283 3.335 3.342 3.345 ...

..$ : num [1:7] -0.882 0.937 0.954 0.955 0.955 ...

$ type : chr "standard"

$ lambda : num 0.14

$ lambda2: NULL

$ wcsd.x : num [1:100] 0.733 1.099 1.252 1.102 0.974 ...

$ x : num [1:20, 1:100] 2.59 2.71 1.89 1.55 2.61 ...

$ y : num [1:20] 1 1 1 1 1 2 2 2 2 2 ...

- attr(*, "class")= chr "penlda"

45/143

(46)

PenalizedLDA {penalizedLDA}

> par(mfrow=c(1, 2))

> plot(fit.plda$xproj[,1:2], col=y+1, main="training data")

> plot(fit.plda$xteproj[,1:2], col=fit.plda$ypred+1, main="test data")

> pred.fit.plda <- predict(fit.plda, xte=x.test)

> pred.fit.plda

$ypred

[,1] [,2]

[1,] 1 1 [2,] 1 1 [3,] 1 1 [4,] 1 1 [5,] 1 1 [6,] 2 2 [7,] 3 2 ...

[18,] 3 3 [19,] 3 3 [20,] 3 3

46/143

(47)

Decision Tree (決策樹)



A decision trees is a classifier expressed as a recursive partition of the instance space.



A decision tree consists of internal nodes that represent the decisions corresponding to the hyperplanes or split points (i.e., which half-space a given point lies in), and leaf nodes that represent regions or partitions of the data space, which are labeled with the majority class.

Images source: Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining 1st Edition, Publisher: Pearson; 1 edition (May 12, 2005)

To classify a new test point we have to recursively evaluate which half-space it belongs to until we reach a leaf node in the decision tree, at which point we predict its class as the label of the leaf.

47/143

(48)

Decision Tree (決策樹)



A decision tree uses an axis-parallel hyperplane to split the data space R into two resulting half-spaces or regions, say R

₁

and R

₂

, which also induces a partition of the input points into D

₁

and D

₂

, respectively.



Each of these regions is recursively split via axis-parallel hyperplanes until the points within an induced partition are relatively pure in terms of their class labels.



The resulting hierarchy of split decisions constitutes the decision tree model, with the leaf nodes labeled with the majority class among points in those regions.

Yan-yan SONG and Ying LU, Decision tree methods: applications for classification and prediction, Shanghai Arch Psychiatry. 2015 Apr 25; 27(2): 130–135.

48/143

(49)

Decision Rules



A tree can be read as set of decision rules, with each rule's antecedent comprising the decisions on the internal nodes along a path to a leaf, and its consequent being the label of the leaf node.



Because the regions are all disjoint and cover the entire space, the set of rules can be interpreted as a set of alternatives or disjunctions.

Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, May 2014.

49/143

(50)

Representation of a Decision Tree

Rule representation

Conditional inference tree with 4 terminal nodes Response: Species

Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 138

1) Petal.Length <= 1.9; criterion = 1, statistic = 128.431 2)* weights = 44

1) Petal.Length > 1.9

3) Petal.Width <= 1.7; criterion = 1, statistic = 63.498

4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.638 5)* weights = 43

4) Petal.Length > 4.8 6)* weights = 8 3) Petal.Width > 1.7

7)* weights = 43

Tree representation

50/143

(51)

Axis-Parallel Hyperplanes



Training dataset D={x

_i

, y

_i

}: consist of n points in a d-dimensional space, x

_i

: numeric variables; y

_i

: class label.



A hyperplane h (x) is defined as the set of all points x that satisfy



w is a weight vector that is normal to the hyperplane, and b is the offset of the hyperplane from the origin.



A decision tree considers only axis-parallel hyperplanes, that is, the weight vector must be parallel to one of the original

dimensions or axes x

_j

.

51/143

(52)

Split Points and Data Partition



A hyperplane specifies a decision or split point because it splits the data space R into two half-spaces. All points x such that h(x) ≤ 0 are on the hyperplane or to one side of the hyperplane,

whereas all points such that h(x) > 0 are on the other side.



The generic form of a split point for a numeric attribute X

_j

is given as



The decision or split point X

_j

≤ v thus splits the input data space R into two regions R

_Y

and R

_N

, which denote the set of all

possible points that satisfy the decision and those that do not.



A split point of the form X

_j

≤ v induces the data partition

52/143

(53)

Purity

 The purity of a region R _j is defined in terms of the mixture of classes for points in the

corresponding data partition D _j .

 Formally, purity is the fraction of points with the majority label in D _j , that is

where n _j =| D _j | is the total number of data points in

the region R _j ,and n _ji is the number of points in D _j with class label c.

53/143

(54)

Purity Example



Use a size threshold of 5 and a purity threshold of 0.95 in this example.



A region will be split further only if the number of points is more than 5 and the purity is less than 0.95.

54/143

(55)

Decision Tree Algorithm

55/143

(56)

Split Point Evaluation Measures: Entropy

 Entropy measures, H(D), the amount of disorder or uncertainty in a system.



In the classification setting, a partition has lower entropy (or low disorder) if it is relatively pure (points from the same class) .



A partition has higher entropy (or more disorder) if the class labels are mixed, and there is no majority class as such.

 If a region is pure, then the entropy is zero.

 If the classes are all mixed up, and each appears with equal probability, , then the entropy has the highest value:

56/143

(57)

 To see if the split point results in a reduced overall entropy, define the information gain for a given split point as:

 The higher the information gain, the more the reduction in entropy, and the better the split point.

 Given split points and their corresponding partitions, we can score each split point and choose the one that gives the highest information gain.

Split Entropy and Information Gain

 Assume that a split point partitions D into D _Y and D _N . The split entropy is defined as the weighted entropy of each of the resulting partitions.

57/143

(58)

Gini Index

 Gini index is used to measure the purity of a split point:

 If the partition is pure, then the probability of the majority class is 1 and the probability of all other classes is 0, and thus, the Gini index is 0.

 When each class is equally represented, with probability p(c _i | D)=1/k, then the Gini index has value (k-1)/k.

 Higher values of the Gini index indicate more disorder, and lower values indicate more order in terms of the class

labels.

58/143

(59)

Other Measures

 The lower the Weighted Gini index value, the better the split point.

 The Classification And Regression Trees (CART) measure prefers a split point that maximizes the difference between the class probability mass function for the two partitions;

the higher the CART measure, the better the split point.

59/143

(60)

The Class Probability Mass Function for a Partition

 If X is a numeric attribute, we have to evaluate split points of the form X ≤ v .

 Consider only the midpoints between two

successive distinct values for X in the sample D.

 Let {v ₁ ,...,v _m } denote the set of all such midpoints, such that v ₁ < v ₂ < ꞏꞏꞏ < v _m .

 For each split point X≤ v, we have to estimate the class PMFs:

Review: Breslow, L. A.and Aha,D. W. (1997).Simplifyingdecision trees:Asurvey.Knowledge Engineering Review, 12(1): 1–40.

60/143

(61)

Estimate the Class PMFs

Using the Bayes theorem

61/143

(62)

Estimate the Class PMFs

62/143

(63)

Evaluate Numeric Attribute

63/143

(64)

Characteristics of Decision Tree

 DT Simplifies complex relationships between input variables and target variables by dividing original input variables into significant

subgroups.

 Easy to understand and interpret.

 Non-parametric approach without distributional assumptions.

 Easy to handle missing values without needing to resort to imputation.

 Easy to handle heavy skewed data without needing to resort to data transformation.

 Robust to outliers (noise).

64/143

(65)



Decision trees do not always deliver the best performance, and represent a trade-off between performance and simplicity of explanation.



Finding an optimal; decision tree is an NP-complete problem. Many DT algorithms employ a heuristic-based approach to guide their search in the vast hypothesis space.



Techniques developed for constructing DT are computationally inexpensive.



DT provides an expressive representation for learning discrete- valued functions.



The presence of redundant attributes does not adversely affect the accuracy of DTs.



Data fragmentation problem: at the leaf nodes, the number of records may be too small to make a statistically significant

decision about the class representation of the nodes.

65/143

(66)



The choice of impurity measure has little effect on the performance of decision tree. The strategy used to prune the tree has a greater impact on the final tree than the choice of impurity measure.



The decision tree structure can represent both classification and regression models.



The subtree replication problem.



The border between two neighboring regions of different classes is known as a decision boundary. A data set may not be classified effectively by a decision tree algorithm that uses test conditions involving only a single attribute a time. (overcame)

Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining 1st Edition, Publisher: Pearson; 1 edition (May 12, 2005)

66/143

(67)

Model Overfitting

 A good classification model must not only fit the training data well, it must also accurately classify records it has never seen before.

 Model overfitting: a model fits the training data too well can have a poorer generalization error than a model with a higher training error.

 Some potential causes of model overfitting:



due to presence of noise.



due to lack of representative samples.



multiple comparison procedure.

 A survey on decision tree-pruning methods to avoid overfitting: Breslow and Aha (1997) and Esposito et al.

(1997).

67/143

(68)

Popular Algorithms



AID (automatic interaction detection), CHAID (Chi-Squared Automatic Interaction Detection)



ID3 (iterative dichotomizer 3rd) (Quinlan, 1986) , C4.5 (Quinlan, 1993), C5



CART (Classification and Regression Trees) (Breiman et al., 1984).

 CART and C4.5 perform an exhaustive search over all possible splits maximizing an information measure of node impurity selecting the covariate showing the best split. This approach has two fundamental problems: overfitting and a selection bias towards covariates with many possible splits.



QUEST (Quick, Unbiased, Efficient, Statistical Tree.) Wei-Yin Loh and Yu-Shan Shih (1997)



SAS algorithms: incorporate and extend most of the good ideas discussed for recursive partitioning with univariate splits.

Yan-yan SONG and Ying LU, Decision tree methods: applications for classification and prediction, Shanghai Arch Psychiatry. 2015 Apr 25; 27(2): 130–135.

68/143

(69)

R Package:

party



The party package (Hothorn, Hornik, and Zeileis 2006) aims at providing a recursive part(y)itioning laboratory assembling various high- and low-level tools for building tree-based

regression and classification models.



This includes conditional inference trees ( ctree ), conditional inference forests ( cforest ) and parametric model trees ( mob ).



At the core of the package is ctree , an implementation of conditional inference trees which embed tree-structured regression models into a well defined theory of conditional inference procedures.



This non-parametric class of regression trees is applicable to all kinds of regression problems, including nominal, ordinal,

numeric, censored as well as multivariate response variables and arbitrary measurement scales of the covariates.

69/143

(70)

R Package:

party

> library(party)

> str(iris)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> id <- sample(2, nrow(iris), replace = TRUE, prob = c(0.9, 0.1))

> train.data <- iris[id == 1, ]

> test.data <- iris[id == 2, ]

>

> # method 2

> #id <- sample(1:nrow(iris), nrow(iris)/10)

> #train.data <- iris[-id, ]

> #test.data <- iris[id, ]

>

> myModel <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

> iris.ctree <- ctree(myModel, data=train.data)

> table(predict(iris.ctree), train.data$Species) setosa versicolor virginica

Hothorn, T., Hornik, K., Strobl, C., and Zeileis, A. (2015). Party: A laboratory for recursive partytioning. http://cran.r- project.org/web/packages/party/. R package

version 1.0-23.

70/143

(71)

R Package:

party

> print(iris.ctree)

Conditional inference tree with 4 terminal nodes Response: Species

Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 132

1) Petal.Length <= 1.9; criterion = 1, statistic = 123.873 2)* weights = 45

1) Petal.Length > 1.9

3) Petal.Width <= 1.7; criterion = 1, statistic = 58.5

4) Petal.Length <= 4.7; criterion = 0.998, statistic = 12.429 5)* weights = 40

4) Petal.Length > 4.7 6)* weights = 7 3) Petal.Width > 1.7

7)* weights = 40

>

> plot(iris.ctree)

> plot(iris.ctree, type="simple")

> testPred <- predict(iris.ctree, newdata = test.data)

> table(testPred, test.data$Species) testPred setosa versicolor virginica

71/143

(72)

> plot(iris.ctree) ^72/143

(73)

R Package:

rpart



Function rpart() is used to build a decision tree, and the tree with the minimum prediction error is selected.



After that, it is applied to new data to make prediction with function predict() .

Therneau, T., Atkinson, B., and Ripley, B. (2015). rpart: Recursive Partitioning and Regression Trees.

> install.packages("TH.data")

> data("bodyfat", package = "TH.data")

> ? bodyfat

> dim(bodyfat) [1] 71 10

> head(bodyfat)

age DEXfat waistcirc hipcirc elbowbreadth kneebreadth anthro3a anthro3b anthro3c anthro4 47 57 41.68 100.0 112.0 7.1 9.4 4.42 4.95 4.50 6.13 48 65 43.29 99.5 116.5 6.5 8.9 4.63 5.01 4.48 6.37 49 59 35.41 96.0 108.5 6.2 8.9 4.12 4.74 4.60 5.82 50 58 22.79 72.0 96.5 6.1 9.2 4.03 4.48 3.91 5.66 51 60 36.42 89.5 100.5 7.1 10.0 4.24 4.68 4.15 5.91 52 61 24.13 83.5 97.0 6.5 8.8 3.55 4.06 3.64 5.14

Example: bodyfat {TH.data} Prediction of Body Fat by Skinfold Thickness, Circumferences, and Bone Breadths

73/143

(74)

rpart

Example

Chapter 4: Decision Tree and Random Forest http://www.RDataMining.com

> set.seed(1234)

> id <- sample(2, nrow(bodyfat), replace=TRUE, prob=c(0.7, 0.3))

> bodyfat.train <- bodyfat[id==1,]

> bodyfat.test <- bodyfat[id==2,]

> # train a decision tree

> library(rpart)

> my.model <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth

> bodyfat.rpart <- rpart(my.model, data = bodyfat.train, control = rpart.control(minsplit = 10))

> bodyfat.rpart n= 56

node), split, n, deviance, yval

* denotes terminal node

1) root 56 7265.0290000 30.94589

2) waistcirc< 88.4 31 960.5381000 22.55645 4) hipcirc< 96.25 14 222.2648000 18.41143

8) age< 60.5 9 66.8809600 16.19222 * 9) age>=60.5 5 31.2769200 22.40600 * 5) hipcirc>=96.25 17 299.6470000 25.97000

10) waistcirc< 77.75 6 30.7345500 22.32500 * 11) waistcirc>=77.75 11 145.7148000 27.95818

22) hipcirc< 99.5 3 0.2568667 23.74667 * 23) hipcirc>=99.5 8 72.2933500 29.53750 * 3) waistcirc>=88.4 25 1417.1140000 41.34880

6) waistcirc< 104.75 18 330.5792000 38.09111 12) hipcirc< 109.9 9 68.9996200 34.37556 * 13) hipcirc>=109.9 9 13.0832000 41.80667 * 7) waistcirc>=104.75 7 404.3004000 49.72571 *

74/143

(75)

rpart

Example

> attributes(bodyfat.rpart)

$names

[1] "frame" "where" "call" "terms" "cptable" "method"

[7] "parms" "control" "functions" "numresp" "splits" "variable.importance"

[13] "y" "ordered"

$xlevels named list()

$class

[1] "rpart"

> str(bodyfat.rpart) List of 14

$ frame :'data.frame': 15 obs. of 8 variables:

..$ var : Factor w/ 4 levels "<leaf>","age",..: 4 3 2 1 1 4 1 3 1 1 ...

..$ n : int [1:15] 56 31 14 9 5 17 6 11 3 8 ...

..$ wt : num [1:15] 56 31 14 9 5 17 6 11 3 8 ...

..$ dev : num [1:15] 7265 960.5 222.3 66.9 31.3 ...

..$ yval : num [1:15] 30.9 22.6 18.4 16.2 22.4 ...

..$ complexity: num [1:15] 0.6727 0.0604 0.0171 0.01 0.01 ...

..$ ncompete : int [1:15] 4 4 4 0 0 4 0 4 0 0 ...

..$ nsurrogate: int [1:15] 4 4 2 0 0 4 0 2 0 0 ...

$ where : Named int [1:56] 14 14 13 7 9 10 13 9 5 9 ...

..- attr(*, "names")= chr [1:56] "47" "48" "49" "50" ...

...

$ ordered : Named logi [1:5] FALSE FALSE FALSE FALSE FALSE

..- attr(*, "names")= chr [1:5] "age" "waistcirc" "hipcirc" "elbowbreadth" ...

- attr(*, "xlevels")= Named list() - attr(*, "class")= chr "rpart"

75/143

(76)

rpart

Example

> # print the matrix of information on the optimal prunings based on a complexity parameter.

> print(bodyfat.rpart$cptable)

CP nsplit rel error xerror xstd 1 0.67272638 0 1.00000000 1.0194546 0.18724382 2 0.09390665 1 0.32727362 0.4415438 0.10853044 3 0.06037503 2 0.23336696 0.4271241 0.09362895 4 0.03420446 3 0.17299193 0.3842206 0.09030539 5 0.01708278 4 0.13878747 0.3038187 0.07295556 6 0.01695763 5 0.12170469 0.2739808 0.06599642 7 0.01007079 6 0.10474706 0.2693702 0.06613618 8 0.01000000 7 0.09467627 0.2695358 0.06620732

>

> plot(bodyfat.rpart)

> text(bodyfat.rpart, use.n=T)

76/143

(77)

rpart

Example

> #select the tree with the minimum prediction error

> opt <- which.min(bodyfat.rpart$cptable[,"xerror"])

> opt 7 7

> cp <- bodyfat.rpart$cptable[opt, "CP"]

> cp

[1] 0.01007079

> bodyfat.prune <- prune(bodyfat.rpart, cp = cp)

> bodyfat.prune n= 56

node), split, n, deviance, yval

* denotes terminal node 1) root 56 7265.02900 30.94589

2) waistcirc< 88.4 31 960.53810 22.55645 4) hipcirc< 96.25 14 222.26480 18.41143

8) age< 60.5 9 66.88096 16.19222 * 9) age>=60.5 5 31.27692 22.40600 * 5) hipcirc>=96.25 17 299.64700 25.97000

10) waistcirc< 77.75 6 30.73455 22.32500 * 11) waistcirc>=77.75 11 145.71480 27.95818 * 3) waistcirc>=88.4 25 1417.11400 41.34880

6) waistcirc< 104.75 18 330.57920 38.09111 12) hipcirc< 109.9 9 68.99962 34.37556 * 13) hipcirc>=109.9 9 13.08320 41.80667 * 7) waistcirc>=104.75 7 404.30040 49.72571 *

> plot(bodyfat.prune)

> text(bodyfat.prune, use.n=T, cex=0.7)

77/143

(78)

rpart

Example



After that, the selected tree is used to make prediction and the predicted values are

compared with actual labels.

In the code below, function abline() draws a diagonal line.



The predictions of a good model are expected to be equal or very close to their actual values, that is, most points should be on or close to the diagonal line.

> DEXfat.pred <- predict(bodyfat.prune, newdata=bodyfat.test)

> xlim <- range(bodyfat$DEXfat)

> plot(DEXfat.pred ~ DEXfat, data=bodyfat.test, xlab="Observed", ylab="Predicted", ylim=xlim, xlim=xlim)

> abline(a=0, b=1)

Chapter 4: Decision Tree and Random Forest, http://www.RDataMining.com

Actual Response

http://www.hmwu.idv.tw

吳漢銘

分類法則

Classification

C03

 Introduction

 Classification of Subjects or Samples (Supervised Learning)

 Performance Measures (評估指標)

 Methods

K-Nearest Neighbors (KNN) ( k

Linear/Quadratic Discriminant Analysis (LDA/QDA) (區別分析)

Classification Tree (Decision Tree) (分類樹、決策樹)

Random Forest (隨機森林)

Support Vector Machine (SVM) (支持向量機)

Artificial Neural Network (ANN) (人工神經網路)



XGBoost: eXtreme Gradient Boosting

2/143

CRAN Task View: Machine Learning & Statistical Learning http://cran.r-project.org/web/views/MachineLearning.html

Topics: Neural Networks, Recursive Partitioning, Random Forests, Regularized and Shrinkage Methods, Boosting, Support Vector Machines and Kernel

Methods, Bayesian Methods, Optimization using Genetic Algorithms, Association Rules, Fuzzy Rule-based System, Model selection and validation, Meta packages, Elements of Statistical Learning.

knn (最近k鄰居分類法)

lda (線性區別分析)

Decision Tree (決策樹)

SVM (支持向量機)

3/143

 Classification

 Clustering (unsupervised learning) (群集分析、非監督式學習)

 Discriminant Analysis (supervised learning, classification) (區別分析、監督式學習、分類法則)

 Discriminant Analysis

 It focuses on situations where the different groups (clusters) are known a priori.

 Decision rules are provided in classifying a multivariate observation into one of the known groups.

 Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y.

4/143

Class prediction analysis is designed to predict the value, or “class”, of an individual parameter in an uncharacteristic sample or set of samples.

Apply classification to microarray data

Predict cancer types using genomic expression profiling.

Predict the class/phenotype/parameter of a sample.

Identify genes that discriminate well among classes

Identify samples that could be potential outliers.

Examples of classification task

Classifying credit card transactions as legitimate or fraudulent

Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil.

Categorizing news stories as finance, weather, entertainment, sports, etc.

Fernandez-Delgado, M., Cernadas, E., Barro, S. & Amorim, D. Do we need hundreds of classifiers to solve real world classification problems?

Journal of Machine Learning Research 15, 3133–3181 (2014). [ 被引用 1883 次] (179 classifiers on 121 data sets)

5/143

Classification of Genes, Tissues or Samples 6/143

n-fold Cross-Validation Error Rates 7/143

Split Data into Training Set and Test Set

8/143

Split Data into Training Set and Test Set

> splits <- splitdf(iris, 0.9, 12345)

> lapply(splits, dim)

> iris.training <- splits$trainset

> iris.testing <- splits$testset

splitdf <- function(df, train.ratio, seed=NULL) { if (!is.null(seed)) set.seed(seed)

index <- 1:nrow(df)

id <- sample(index, trunc(length(index)*train.ratio)) train <- df[id, ]

test <- df[-id, ]

list(trainset=train,testset=test) }

library(dplyr)

iris.train <- sample_frac(iris, 0.9) id <- as.numeric(rownames(iris.train)) iris.test <- iris[-id, ]

9/143

Split Data into Test and Training Set According to Group Labels

10/143

(TP)

(TN) (FP)

(FN)

Binary Classifier Confusion Matrix

11/143

Total: N True (1) False (0)

True (1) True Positive TP

False Negative FN

False (0) False Positive FP

True Negative TN

Predicted Response

Actual Response

TP TP FN

Classification of Genes, Tissues or Samples ^6/143

n-fold Cross-Validation Error Rates ^7/143

**id <- sample(index, trunc(length(index)*train.ratio)) train <- df[id, ]**

x of x = ( x ₁ ,…, x _p ) with large ratios of between-groups to within-groups sum of squares.