Predictive Performance

Chapter 3 Practical Application

3.6 Predictive Performance

In order to assess the performance of BMA method for prediction, we would employ data partition mentioned in chapter 3.3 to use. Training set includes 58,376 trade records, and 14,360 samples (7,180 records are positive events) would be selected via under-sampling technique in 3.3. Testing set contains 14,594 records and 1,795 observations of them are bad-purchasing records. Based on cut point simulated by cross-validation process in 3.4, we would apply the data in training set to construct BMA, stepwise and complete models to compare the performance of prediction.

The data in training set was used to identify the relationships between predictors and bad-purchasing events by using these three analysis methods. Then we compared these predictive performance of three models by calculating the F0.5, F1, F2 score with testing set. Additionally, the precision and recall for each model in Table 6 to Table 8 with the test data set. In these tables, the details of measurements are displayed for five times in each model method.

Because the samples in training and testing sets might be different, it is reasonable to implement these model procedures for many times and take the averaging results.

Therefore, we took much time to run the computation process and Table 9 presented the averaging result of one hundred computer simulations.

As Table 9 suggests, no matter in F0.5, F1 or F2 measures, BMA model has slightly better performance prediction. Since BMA has outstanding performance for precision of prediction, the mean of F0.5 measure is better than the others. So does F1 measure and F2 measure. Although BMA presents higher mean of assessments for this case, the variance of F-measures in BMA are not smaller than the others.

AUC, which terms the distinguish ability of model, reveals good level for each model. All the AUC values are fallen in the 0.70 to 0.75. The most outstanding model is stepwise chosen model, the second is BMA model and the third is complete one.

Often, there is a trade-off relationship between precision and recall. For example, If less positive test outcomes in A model than B model for the same cut-point, then the result of A model will have less False Positive, and more False Negative. Therefore, from the definitions of precision and recall, you'll see that the precision is likely (not always, because it also depends on the number of True Positive) to increase, and recall is likely to decrease. Accordingly, we can discover something interesting in our analysis. For this car

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

data, equipped with F1-oriented cutting point, BMA method is the best in precision assessment, the second is complete model and the third is stepwise approach.

Conversely, in recall, stepwise approach gives the greatest performance, the second is complete one and the third is BMA. The variances of precision and recall in complete model are larger than BMA and stepwise ones. The reason of this consequence might be explained as the following: Based on similar F1-oriented cut-points (both around 0.6) in BMA model and stepwise one, the positive test outcomes in BMA model were more than the ones in stepwise one. Namely, Less test outcomes were judged as kicked-vehicle in BMA than the ones in stepwise model. (shown in Table 10)

Overall, the performance of prediction does not meet our expectation because of low precision and recall. The possible reasons may be the limitation of logistic regression (such as linear assumption) , ignoring the interaction of predictors and hazy-understanding for car-domain knowledge. BMA, which belongs to ensemble method, shows better performance of F-measure. This consequence is matched with the announcement:

“averaging over all the models provides better average predictive ability” noted by Madigan and Raftery (1994).

Table 6 

The Predictive Performance Of Complete BMA Examples

BMA BMA BMA BMA BMA AVG

F0.5 0.5050 0.5099 0.5113 0.5055 0.5077 0.5079

F1 0.3949 0.3974 0.4034 0.4007 0.3947 0.3982

F2 0.4695 0.4697 0.4743 0.4745 0.4761 0.4728

AUC 0.7417 0.7361 0.7431 0.7394 0.7429 0.7406

Precision (F1) 0.4382 0.4763 0.4635 0.4340 0.4686 0.4561

Recall (F1) 0.3593 0.3409 0.3571 0.3721 0.3409 0.3541

Table 7  

The Predictive Performance Of Complete GLM examples

GLM(compl

F0.5 0.5055 0.4980 0.5050 0.5101 0.5162 0.5070

F1 0.3955 0.3880 0.3919 0.3788 0.3916 0.3892

F2 0.4751 0.4718 0.4728 0.4682 0.4716 0.4719

AUC 0.7428 0.7387 0.7396 0.7362 0.7409 0.7396

‧

! !

!

Table 8 

The Predictive Performance Of Stepwise GLM examples

GLM(stepwi

F0.5 0.5106 0.5041 0.5067 0.4946 0.4980 0.5028

F1 0.3882 0.3885 0.3874 0.3946 0.3875 0.3892

F2 0.4698 0.4784 0.4745 0.4755 0.4713 0.4739

AUC 0.7359 0.7445 0.7410 0.7445 0.7401 0.7412

Precision (F1) 0.3762 0.3638 0.3575 0.3936 0.3592 0.3701

Recall (F1) 0.4011 0.4167 0.4228 0.3955 0.4206 0.4114

Table 9 

The Predictive Performance Result In 100 Simulations

GLMstepwise GLM_full BMA

Mean SD Mean SD Mean SD

F0.5 0.5044 0.0062 0.5052 0.0053 0.5056 0.0052

F1 0.3908 0.0056 0.3904 0.0049 0.3966 0.0039

F2 0.4738 0.0028 0.4735 0.0026 0.4740 0.0027

AUC 0.7415 0.0035 0.7388 0.0027 0.7411 0.0030

Precision (F1) 0.3698 0.0156 0.3993 0.0487 0.4533 0.0202

Recall (F1) 0.4117 0.0130 0.3868 0.0334 0.3574 0.0128

Table 10  

Positive 647 795 Test

outcome

Positive 720 1194

Negative 1148 12004 Negative 1075 11605

precision 0.4486 precision 0.3762

4 CONCLUDING REMARKS!

For data analysis, there are two main cultures mentioned by Breiman (2001). One is prediction-focused goal and the other one is explanatory. In this thesis, we are eager to focus on predictive goal in binary classification problem. Logistic regression, which serves as a common and simple model, was used and discussed.

In traditional Logistic regression, analysts might confront with some issues as follows. For example, a model selection such as stepwise approach leads to a single

‘optimal’ model and then make inferences via this final model. However, this method critically ignores uncertainty. Besides model uncertainty, similar criteria seem to be another issue. A handy example demonstrating this dilemma will be the Primary Biliary Cirrhosis example noted in Raftery’s paper. The BIC criteria of models are similar but the performances are very different.

To apply Bayesian model averaging method to Logistic regression, we set all the prior probabilities of possible models to be equal without expert information. Additionally, the Occam’s window and Laplace method were employed so that making the computing process to be efficient. In this thesis, we applied the BMA model, traditional stepwise model and full one to CARVANA practical data. Since we sought to enhance the precision of prediction, the under-sampling technique and cutting point simulation were also used in each model. As the results of predictive performance suggests, no matter which F-measure we choose, BMA slightly outperforms the stepwise method and the full model in CARVANA data. Asides from the performances of F-measures, with the cutting point found by F1-measure, BMA works better for the prediction. For the recall, stepwise method performs better. After confirming the test results in some simulations, the main reason for this consequence is that less test outcomes were judged as kicked-vehicle in BMA than the ones in stepwise model for CARNAVA data.

In CARVANA example, besides the little improvement of predictive performance by BMA method, most of the significant point estimates in stepwise model are contained in BMA model. This result speaks volumes for the possible advantage of BMA method since it took possible models into consideration.

As these models suggest, the predictors: the year and the odometer reading of the vehicle have the significant and positive influence on the occurrence of bad-purchasing.

The acquisition average price and the cost paid at time of purchase are the only significant

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

predictors in acquisition price indexes and they have negative effect. Other predictors are partially supported in one model. Most of them are partially significant and reveal strong evidence in stepwise one. This result is matched with the announcement: p-values arguably overstate the evidence for an effect even when there is no model uncertainty (Berger and Delampady, 1987). In addition to the purpose of predictive aim, BMA serves as a appropriate tool for exploratory goal for explanatory variables. The input variables, which partially left in the BMA model, suggest that there exists weak association with response variable. But there show no evidence in stepwise one. There is another clinical example demonstrating this claim could be referred in Wang, Zhang and Bakhai (2004).

Consequently, with the approaches mentioned in this thesis such as cutting point simulation, analyst can follow and implement BMA model simply. There is no need to choose one model. It is possible to average the predictions from several models.

Therefore, for practical classification problem, BMA method could achieve better prediction accuracy than a single model and dilute model uncertainty concern.

There are some issues we may consider and might modify in the future. Without the expert knowledge of used vehicles, the ability of precision in our example seems not to meet our expectation (only about 0.4). If one is eager to enhance the precision of prediction, The feature extraction work by domain-knowledge will be a critical step.

Second, the setting of prior distribution may influence the predictive performance. Next, the computing process in “BMA” package is approximated by Laplace method based on simplicity. The accurate posterior probabilities of models can be computed by MCMC method. And this approximation may affect the performance. Another concern is that only one practical data we applied for analysis, more data should be applied to confirm the improvement of BMA method.

Presented by Monteith et al. (2011) is Bayesian model combination (BMC). BMC, which model weights are modified in Dirichlet distribution, terms a correction to BMA.

Although BMC is somewhat more computationally expensive than BMA, it is worth to try and compare this model in the future.

Last but not least, the efficiency of computing may be another issue that analyst will concern. It took about 20 seconds to implement one BMA model and 8 minutes for the whole BMA comparing program. And one stepwise model took 50 seconds and 13 minutes to finish the stepwise comparing precess. Last but no least, full model took only 1 second to implement one model and 90 seconds to complete the whole process.

‧

REFERENCES!

Berger, J. O., and Delampady, M. (1987). Testing precise hypotheses. Statistical science, 317-335.!

Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition 30, 1145-1159.!

Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science 16, 199-231.!

Chen, C., Liaw, A., and Breiman, L. (2004). Using random forest to learn imbalanced data.

University of California, Berkeley.!

Cortes, C., and Mohri, M. (2004). Confidence Intervals for the Area Under the ROC Curve.

In Nips.!

Dickinson, J. P. (1973). Some statistical results in the combination of forecasts. Journal of the Operational Research Society 24, 253-260.!

Dietterich, T. G. (2000). Ensemble methods in machine learning. In Multiple classifier systems, 1-15: Springer.!

Draper, D. (1995). Assessment and propagation of model uncertainty. Journal of the Royal Statistical Society. Series B (Methodological), 45-97.!

Fernandez, C., Ley, E., and Steel, M. F. J. (2001). Benchmark priors for Bayesian model averaging. Journal of Econometrics 100, 381-427.!

Fleming, T. R., and Harrington, D. P. (1991). Counting processes and survival analysis.

1991. John Wiley&Sons, Hoboken, NJ, USA.!

Grambsch, P. M., Dickson, E. R., Kaplan, M., Lesage, G., Fleming, T. R., and Langworthy, A. L. (1989). Extramural cross-validation of the mayo primary biliary cirrhosis survival model establishes its generalizability. Hepatology 10, 846-850.!

Hoeting, J., Raftery, A. E., and Madigan, D. (1996). A method for simultaneous variable selection and outlier identification in linear regression. Computational Statistics &

Data Analysis 22, 251-270.!

Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian model averaging: a tutorial. Statistical science, 382-401.!

Kass, R. E., and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association 90, 773-795.!

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, 1137-1145.!

Leamer, E. E., and Leamer, E. E. (1978). Specification searches: Ad hoc inference with nonexperimental data: Wiley New York.!

Lemeshow, S., and Hosmer, D. (2000). Applied Logistic Regression (Wiley Series in Probability and Statistics: Wiley-Interscience; 2 Sub edition.!

Madigan, D., and Raftery, A. E. (1994). Model selection and accounting for model

uncertainty in graphical models using Occam's window. Journal of the American Statistical Association 89, 1535-1546.!

Madigan, D., York, J., and Allard, D. (1995). Bayesian graphical models for discrete data.

International Statistical Review/Revue Internationale de Statistique, 215-232.!

Markus, B. H., Dickson, E. R., Grambsch, P. M., et al. (1989). Efficacy of liver

transplantation in patients with primary biliary cirrhosis. New England Journal of Medicine 320, 1709-1713.!

Merlise, A. (1999). Bayesian model averaging and model search strategies. Bayesian statistics 6, 157.!

Miller, A. (2002). Subset selection in regression: CRC Press.!

Monteith, K., Carroll, J. L., Seppi, K., and Martinez, T. (2011). Turning Bayesian model averaging into Bayesian model combination. In Neural Networks (IJCNN), The 2011 International Joint Conference on, 2657-2663: IEEE.!

Posada, D., and Buckley, T. R. (2004). Model selection and model averaging in phylogenetics: advantages of Akaike information criterion and Bayesian approaches over likelihood ratio tests. Systematic biology 53, 793-808.!

Raftery, A. E. (1996). Approximate Bayes factors and accounting for model uncertainty in generalised linear models. Biometrika 83, 251-266.!

Raftery, A. E., Gneiting, T., Balabdaoui, F., and Polakowski, M. (2005). Using Bayesian model averaging to calibrate forecast ensembles. Monthly Weather Review 133.!

Raftery, A. E., Madigan, D., and Hoeting, J. A. (1997). Bayesian model averaging for linear regression models. Journal of the American Statistical Association 92, 179-191.!

Regal, R. R., and Hook, E. B. (1991). The effects of model selection on confidence intervals for the size of a closed population. Statistics in medicine 10, 717-721.!

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological), 111-147.!

‧

Tierney, L., and Kadane, J. B. (1986). Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association 81, 82-86.!

Tsao, A. C. (2014). A Statistical Introduction to Ensemble Learning Methods. 中國統計學報 52, 115-132.!

Volinsky, C. T., Madigan, D., Raftery, A. E., and Kronmal, R. A. (1997). Bayesian model averaging in proportional hazard models: assessing the risk of a stroke. Journal of the Royal Statistical Society: Series C (Applied Statistics) 46, 433-448.!

Wang, D., Zhang, W., and Bakhai, A. (2004). Comparison of Bayesian model averaging and stepwise methods for model selection in logistic regression. Statistics in medicine 23, 3451-3467.!

Wasserman, L. (2000). Bayesian model selection and model averaging. Journal of mathematical psychology 44, 92-107.!

Zhu, M. (2004). Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo.!

!

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

在文檔中貝式模型平均法在預測分析之應用 - 政大學術集成 (頁 29-0)

Chapter 3 Practical Application

3.6 Predictive Performance

‧

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

! !

!

!

4 CONCLUDING REMARKS!

‧

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

REFERENCES!

‧

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

!

‧

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

立政治大學

立政治大學

立政治大學

立政治大學