Random Forests - Parameter Selection

The Implementations

4.1 Parameter Selection

4.1.2 Random Forests

We consider the random forests implementation in Scikit-learn (Pedregosa et al., 2011), which implements the code by themselves. In Scikit-learn, they have the following options on the random forests models:

1. n_estimators: The number of trees in the random forests.

2. criterion: The function to measure the quality of a split.

3. max_features: The number of features to consider when looking for the best split.

4. max_depth: The maximum depth of the tree.

5. min_samples_split: The minimum number of samples required to split an internal node.

6. min_samples_leaf: The minimum number of samples required to be at a leaf node.

Because of the many parameters, it is not practical to check all combinations of pa-rameter values. To save the running time, we consider only two or three candidates for each value. They are listed below.

n_estimators = {100},

criterion = {Gini, Entropy}, max_feature = {√

n, 0.001n, 0.002n, 0.005n, 0.01n, 0.02n, 0.05n, 0.1n, 0.2n, 0.5n}, max_depth = {unlimited, 200, 100},

min_sample_split ={2, 4}, min_sample_leaf ={1, 2}.

Note that parameters min_sample_split, min_sample_leaf and max_depth are related. Re-called that the training time is related to the complexity O(N× d × p × l). When we have more instances, a smaller number of max_feature (p) should be chosen. Otherwise, the training time may be too long. Therefore, for the data sets url, kdd10b and kdd10b-raw, we have to limit the max_depth to{10, 20, 50}, reduce the n_estimators and the max_feature to{20} and {0.0001n, 0.0002n, 0.0005n, 0.001n}, respectively. For url, we reduce the max_feature to{0.0001n, . . . , 0.01n}. For the data set kdd12, we only run the parameters

{ n_estimators, criterion, max_feature, max_depth } = {20, gini, 0.0001n, 10}

in a reasonable time. Note that, we also cancel the cross validation on these large data sets when we train the random forests model, and change to use holdout validation with a split ratio 4 : 1 (training : validation). In addition, we change min_sample_leaf and min_sample_split to{1} and {2}, respectively, because we observe that they do not impact the test accuracy very much.

4.1.3 GBDT

We consider the GBDT implementation in XGBoost (Chen and Guestrin, 2016) and LightGBM (Meng et al., 2016). In XGBoost, they have following options on the GBDT models:

1. rounds: The number of trees in the GBDT.

2. eta: The learning rate.

3. max_depth: The maximum depth of the tree.

4. min_child_weight: Minimum sum of instance hessian (h_ik) needed in a child.

5. gamma: The regularization of the number of the leaves.

6. lambda: The regularization of the value in the leaves (s^m_k).

In LightGBM, they have the following options:

1. num_trees: Same with rounds in XGBoost.

2. learning_rate: Same with eta in XGBoost.

3. max_depth: Same in XGBoost.

4. min_sum_hessian_in_leaf: Same with min_child_weight in XGBoost.

5. min_gain_to_split: Same with gamma in XGBoost.

6. lambda_l2: Same with lambda in XGBoost.

7. num_leaves: The maximum number of the leaves.

8. min_data_in_leaf: The minimum number of samples required to be at a leaf.

9. max_bins: The maximum number of the bins.

Both of them also support the subsampling on the features to decide p and the subsam-pling on the instances to reduce the instance size l. In our experiments, we do not use

the subsample options on features and instances. Note that the regularization parameter lambda and gamma of the XGBoost which we do not claim in this thesis, but the details can be found in Chen and Guestrin (2016).

Similar to the random forests, we consider two or three candidates for each parameter value. For XGBoost,

rounds ={100},

eta ={0.1, 0.2, 0.3, 0.4, 0.5}, max_depth ={4, 5, 6, 7, 8, 9},

min_child_weight ={0, 1, 2}, gamma ={0, 0.1, 1}, lambda ={0, 1, 10}.

For LightGBM,

num_trees ={100},

learning_rate ={0.1, 0.2, 0.3, 0.4, 0.5}, max_depth ={unlimited},

min_sum_hessian_in_leaf ={0, 1, 2}, min_gain_to_split ={0},

lambda_l2 ={0, 1, 10},

num_leaves ={15, 31, 63, 127, 255, 511}, min_sample_leaf ={10, 50, 100},

max_bins ={255}.

Because the number of the parameters in LightGBM is larger than that of XGBoost, we set max_bins, min_gain_to_split and max_depth to the default setting. Note that we do not use the histogram technique and unblanced tree in XGBoost, but we use them in LightGBM.

Futhermore, we consider not only the GBDT model, but also the linear boosting model in XGBoost. The parameters are shown at the following.

1. rounds: The number of trees in the GBDT.

2. eta: The learning rate.

3. lambda: The regularization of the linear regression weight w.

For a linear model, we may consider more candidates for each parameter,

rounds = {100},

eta ={0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, lambda = {2⁻¹⁰, 2⁻⁹, . . . , 2¹⁰}.

Note that, when we use XGBoost to train the data set kdd10b-raw and url, we use the holdout validation with the ratio 4 : 1. We also reduce the search region of the data set kdd10b-raw which we let gamma, lambda and min_child_weight be the default value.

4.2 Result

We run the experiments on a machine with a 10-core CPU i7-6950X and 128GB RAM.

For LIBLINEAR and LIBSVM, we run them by using a single thread. For the others, we use ten threads. We compare the following methods.

SVM_LR : SVM with linear kernel and LR loss.

SVM_RBF : SVM with Gaussion kernel.

RF : Random forests.

XGB_linear : XGBoost with using linear regression.

XGB : XGBoost with using regression tree.

LightGBM : LightGBM.

Data sets SVM_LR SVM_RBF RF XGB_linear XGB LightGBM

astro-physic 96.81 97.31 96.67 96.99 96.30 96.38

cod-RNA 95.11 96.67 96.68 94.48 96.79 96.80

covtype 75.77 96.11 97.65 75.32 95.73 97.39

ijcnn1 91.96 98.69 98.21 92.01 98.07 98.32

kdd10b-raw 89.05 - 88.77 88.07 89.05 89.43

kdd10b 88.83 - 86.09 - - 87.96

kdd12 95.56 - 95.55 - - 96.58

MNIST38 96.82 99.70 99.09 96.93 99.29 99.40

news20 97.10 96.90 89.67 97.50 93.87 95.12

rcv1 97.57 - 97.67 97.67 97.58 98.11

real-sim 97.60 97.82 96.17 97.87 98.43 95.43

url 97.80 - 98.97 99.47 99.35 99.64

webspam 92.35 99.26 99.03 92.58 99.23 99.27

yahoojp 92.97 93.31 92.24 93.08 93.40 92.22

Table 4.2: Test accuracy

The results presented in the Table 4.2 are test accuracy in percentage. Because some data sets are too large to be solved by some methods, we use the notation ’-’ to denote that the results are not available. (Training time is greater than 30,000 seconds, or out of memory) Table 4.3 presents the training time (in seconds) with the best parameters after the grid search.

In Table 4.2, we give the result on the following cases:

1. linear≈ kernel ≈ RF ≈ GBDT: astro-physic, real-sim and yahoojp.

2. kernel≈ RF ≈ GBDT > linear: cod-RNA, ijcnn1, MNIST38 and webspam.

3. linear > kernel ≈ RF ≈ GBDT: kdd10b.

4. linear≈ RF ≈ GBDT > kernel: kdd10b-raw and rcv1.

5. RF≈ GBDT > kernel > linear: covtype.

6. linear≈ kernel > GBDT > RF: news20.

7. GBDT > RF≈ linear > kernel: kdd12.

8. GBDT > RF > linear > kernel: url.

Where linear including SVM_LR and XGB_linear, GBDT including XGB and lightGBM.

On the other hand, in Table 4.3, we give the result as the followed:

Data sets SVM_LR SVM_RBF RF XGB_linear XGB LightGBM

astro-physic 2.20 2076.63 38.74 3.50 12.22 5.57

cod-RNA 0.30 411.17 6.33 0.76 1.56 1.61

covtype 2.28 26796.28 131.87 2.80 13.98 3.99

ijcnn1 0.29 45.52 7.65 0.52 1.58 0.92

kdd10b-raw 102.63 - 142.08 306.73 19498.48 467.48

kdd10b 278.19 - 16254.93 - - 1306.15

kdd12 1110.45 - 4054.71 - - 2081.70

MNIST38 0.66 13.95 1.80 0.98 1.18 0.81

news20 4.72 935.79 31.59 3.86 45.15 28.90

rcv1 32.27 - 1497.72 10.75 141.55 185.86

real-sim 1.54 1370.17 12.11 1.95 8.94 4.01

url 81.48 - 2421.92 209.98 564.11 353.44

webspam 9.40 6519.51 390.64 10.35 75.05 7.92

yahoojp 15.39 10026.14 1911.26 13.24 48.15 20.91

Table 4.3: Training time

1. kernel≫ RF > GBDT > linear: astro-physic, cod-RNA, covtype, ijcnn1 and rcv1.

2. kernel≫ RF ≈ GBDT > linear: news20 and real-sim.

3. kernel≫ RF > GBDT > XGB_linear > SVM_LR: url.

4. kernel≫ RF ≈ GBDT ≈ linear: MNIST38.

5. kernel≫ XGB ≫ lightGBM > XGB_linear > RF ≈ SVM_LR: kdd10b-raw.

6. kernel ≈ XGB ≈ XGB_linear ≫ RF > lightGBM > SVM_LR: kdd10b and kdd12.

7. kernel≫ RF > XGB > lightGBM ≈ linear: webspam and yahoojp.

We may observe that kernel method (RBF kernel) cost too much training time, it is not suitable in large-scale data sets. For the tree-based methods, lightGBM gets a beautiful training time, and histogram is very useful in GBDT. The random forests is disadvantaged on the implementation, we do not know the situation that the histogram technique is used with the random forests. For the linear models, the speed of training time is very fast, it is important for analyzing the large-scale data sets, but the performence may not exceed tree-based models and kernel method.

Chapter 5 Conclusion

After analyzing the model of random forests and GBDT, we know that the essence of these two models is totally different. Random forests uses the law of large number to make the prediction of the model be the expectation of it. On the other hand, GBDT adds the submodels as the descent direction of its loss function. This difference may be a reason that GBDT performs better than random forests in some sparse data sets. For the dense data, their performences are similar. Furthermore, according to our experiments, GBDT with the histogram technique is fast enough for the large-scale data sets, and the test accuracy is performed well. There is a disadvantage part of parameter selection in tree-based methods, this disadvantage makes the total training time be longer.

Bibliography

B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classi-fiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992.

L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, August 1996.

L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

L. Breiman, J. Friedman, C. J. Stone, and R. Olshen. Classification and Regression Trees.

Chapman and Hall/CRC; 1 edition, 1984.

C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Trans-actions on Intelligent Systems and Technology, 2(3):27:1–27:27, 2011. Software avail-able athttp://www.csie.ntu.edu.tw/ cjlin/libsvm.

T. Chen and C. Guestrin. Xgboost: A scalable tree boosting sys-tem. In KDD ’16: Proceedings of the 22th ACM SIGKDD interna-tional conference on Knowledge discovery and data mining, 2016. URL http://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf.

B.-Y. Chu, C.-H. Ho, C.-H. Tsai, C.-Y. Lin, and C.-J. Lin. Warm start for parameter selection of linear classifiers. In Proceedings of the 21th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining (KDD), 2015. URL http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/warm-start/warm-start.pdf.

C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008. URLhttp://www.csie.ntu.edu.tw/ cjlin/papers/liblinear.pdf.

M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems?

Journal of Machine Learning Research, 15:3133–3181, 2014. URL http://jmlr.org/papers/v15/delgado14a.html.

J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors).

Ann. Statist., 28(2):337–407, 04 2000. doi: 10.1214/aos/1016218223. URL http://dx.doi.org/10.1214/aos/1016218223.

J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5):1189–1232, 2001.

J. H. Friedman and J. J. Meulman. Multiple additive regression trees with application in epidemiology. Statistics in Medicine, 22(9):1365–1381, 2003. ISSN 1097-0258. doi:

10.1002/sim.1501. URLhttp://dx.doi.org/10.1002/sim.1501.

R. Jin and G. Agrawal. Communication and memory efficient parallel decision tree con-struction. In Proceedings of the 2003 SIAM International Conference on Data Mining, pages 119–129. SIAM, 2003.

S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vec-tor machines with gaussian kernel. Neural Computation, 15(7):

1667–1689, 2003. doi: 10.1162/089976603321891855. URL

http://dx.doi.org/10.1162/089976603321891855.

P. Li. Robust logitboost and adaptive base class (ABC) logitboost. CoRR, abs/1203.3491, 2012. URLhttp://arxiv.org/abs/1203.3491.

Q. Meng, G. Ke, T. Wang, W. Chen, Q. Ye, Z.-M. Ma, and T. Liu. A communication-efficient parallel algorithm for decision tree. In D. D. Lee, M. Sugiyama, U. V.

Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1279–1287. Curran Associates, Inc., 2016. URL

http://papers.nips.cc/paper/6381-a-communication-efficient-parallel-algorithm-for-decision-tree.pdf.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research, 12:2825–2830, 2011.

I. H. Witten, E. Frank, and M. A. Hall. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, third edition, 2011.

在文檔中隨機森林與梯度提升決策樹在大數據下之探討 (頁 40-50)