Assessment Review - 貝式模型平均法在預測分析之應用

According to primary purpose of this thesis is to make forecasts for the future. How to evaluate the performance of prediction would be an issue. In this case, the bad purchase record served as “positive”, and the good purchase one served as “negative”.

In classification problem, two terms are used frequently: PRECISION and RECALL.

The precision for a model means the number of true positive divided by the total number of units labeled as belonging to positive class. On the other hand, the recall means the number of positive divided by the total number of units which actually belong to positive class. Last, ACCURACY is calculated by the sum of true positive and true negative Figure 1 : Precision and Recall (Sensitivity)

positive

Σ True positive + Σ True negative Σ Total population

true negative

accuracy=

precision= True positive

True positive+ False positive

‧

!

(12) Aside from these fundamental assessment measures, For some cases, F-measure is also applied for classification problem since it’s difficult to make a choice between high precision and good recall. In short, F-measure is a hormonic mean of the precision and recall. This measure could not only consider precision and recall together but also modify the weight by the loss ratio of false negative to false positive. By setting different weight "

for the F-measure, the performance might be different for the model. The function is recall. " is used when recall is emphasized rather than precision.

The final subtle measure used to compare the models would be AUC. The AUC is defined as the area under the Receiver Operating Curve (ROC). Confronted with the binary classification task with m positive examples and n negative ones. To set "

to be the outcomes of the model on positive examples and " to be the outcomes of the model on negative examples. The AUC is given by:

! !

(14)

!

which is related to the quality of classification. It could serve as a measure based on the pairwise comparisons between the classifications of two groups. It could be expressed to estimate the probability " that the model ranks a randomly chosen positive example higher than a negative example. Equipped with perfect ranking, AUC would equal 1 since all positive cases are ranked higher than negative ones. The expected AUC value for a

True positive + False negative

F

_β

= (1+ β

) ⋅ precision ⋅recall

random rank model is 0.5. The standard of performance suggested by Hosmer and Lemeshow (2000) is revealed in Table 2.

!

3.1 THE APPLICATION OF CARVANA CASE

!

There are many competitions of data analysis. Especially big data analytics have been prevalent recently. One of the competition websites called Kaggle (http://

www.kaggle.com) has veteran experience. For the binary classification problem, we searched the previous competitions in Kaggle and found that CARVANA data is interesting and the number of observations is immense enough. CARVANA company (http://

www.carvana.com/ ), which is touted as a 100% online car buying American company and provides potential car buyers the opportunity to browse, finance and purchase a car online, offers a data about used-car auction records. Accordingly, it is the patented 360 degree photo system to capture the interior and exterior of each car that impresses customers for CARVANA website.

People from time to time often confront a risk of fraud when purchasing a used car.

The community calls the customers who involved in unexpected bad purchasing “Kicks”.

According to the data provided by CARVANA, we sought to construct a model so that users can prevent the cars with issues from being sold.

These car records belong to class-imbalanced data, which means that the sample size of some class dominates over others. In this case, the binary response variable

TABLE 2  

The ability of model via AUC measure

AUC The distinguish ability of model

AUC<0.6 Poor

0.6<AUC<0.7 Normal

0.7<AUC<0.8 Good

0.8<AUC<0.9 Great

AUC>0.9 Amazing

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

named IsBadBuy is highly unbalanced. The majority part which means good-purchase accounts for 64,007 and the minority one which means bad-purchase has 8,976 samples.

3.2 DATA PREPROCESSING

There raw data contained 72,983 trade records including 34 variables. According to the complete-data rule, the observations with too many null values would be eliminated.

The remaining 72,970 trade records would be employed in the following analysis. Before the feature extraction, the binary response variable and 33 predictive features were collected in this CARVANA data.

After observing the data inputs and exploring the influences on response, we found that not all the predictive features are useful. Some predictive features such as RefID, PurchDate, Auction, VehYear, Model, Trim, Color, Make, PRIMEUNIT, AUCGUART, BYRNO, and VNST were eliminated since it contains too many missing-values, same information as other features, or all values in the feature are identical. As a result, we selected 17 predictive features into the final data set (shown in Table 3) .

Finally, to re-calibrate the mistyping value in some features. For example, “Covers”

are written as “COVER”. And continuous variables were standardized so that we might enhance the quality of prediction. Additionally, all the category variables were re-coded as factors (dummy-variable) to meet the software requisition.

In this process, 72,970 trade records would be used to construct the model of CARVANA via 17 predictive features including categorical and continuous data form.

!

(Dickinson, 1973; Fleming and Harrington, 1991; Grambsch et al., 1989; Leamer and Leamer, 1978; Markus et al., 1989; Miller, 2002; Regal and Hook, 1991)

!

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

!

TABLE 3  Variable Description Field Name Definition

IsBadBuy Identifies if the car is the kicked vehicle VehYear The manufacturer's year of the vehicle

Transmission Vehicles transmission type (Automatic, Manual) WheelTypeID The type id of the vehicle wheel

WheelType The vehicle wheel type description (Alloy, Covers, Null, Special) Veh0do The vehicles odometer reading

Nationality The Manufacturer's country (American, Other, Other Asian, Top Line Asia) MMRAcquisitionA

uctionAveragePri ce

Acquisition price for this vehicle in average condition at time of purchase

MMRAcquisitionA uctionCleanPrice

Acquisition price for this vehicle in the above Average condition at time of purchase

MMRAcquisitionR etailAveragePrice

Acquisition price for this vehicle in the retail market in average condition at time of purchase

MMRAcquisitonR etailCleanPrice

Acquisition price for this vehicle in the retail market in above average condition at time of purchase

MMRCurrentAucti onAveragePrice

Acquisition price for this vehicle in average condition as of current day

MMRCurrentAucti onCleanPrice

Acquisition price for this vehicle in the above condition as of current day

MMRCurrentRetai lAveragePrice

Acquisition price for this vehicle in the retail market in average condition as of current day

MMRCurrentRetai lCleanPrice

Acquisition price for this vehicle in the retail market in above average condition as of current day

VehBCost Acquisition cost paid for the vehicle at time of purchase IsOnlineSale Identifies if the vehicle was originally purchased online WarrantyCost Warranty price (term=36month and millage=36K)

‧

3.3 DATA PARTITION AND SAMPLING

As Figure 2 reveals, we randomly split the data set into training set and test set so that we could be easier to evaluate the performance and over-fitting issue. With the 80-20 rule, the training set contained 58,376 trade records and the test set contained 14,594 ones. Additionally, since the positive responses (means bad-purchase record) account for only 13% for all, we kept the similar positive ratio for training and test sets.

After the Data partition step, the training set would be split into five parts (Shown as Figure. 1). In this case, we would apply 5 fold cross-validation to find the decision-making cut-point. Therefore, each small part of training set had 11,675 records respectively.

As described in 3.1, CARVANA data is highly class-imbalanced. Namely, the number of observations in major group is quite different from that in minority group. The imbalance would influence the performance for some predictive measurement. Under-sampling concept should be employed to enhance the precision of prediction of model and avoid identical predictive result (all data are classified into “good-car”). We would remove partial majority samples to strike the rough balance between the positive cases and negative ones, in other words, the ratio of positive versus negative would be 1:1.

!

Figure 2 : Data Partition and Cross-Validation

Training set (80%)

Test set (20%)

Cross-Validation (1/5) Cross-Validation (1/5)

Cross-Validation (1/5)

Cross-Validation (1/5) Cross-Validation (1/5)

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

3.4 DECISION-MAKING CUT POINT

As mentioned in this chapter 2, the cut-point can be modified according to the requirement. We would discuss how to choose the cut point to make decision. After we constructed the models and derived the probabilities of positive outcomes, the cases would be classified to be bad-purchase or good-deal. In general, the classified decisions would be made by probability 0.5. However, this approach might tip the balance (all cases were judged to be one majority group) or could not reach the top performance degree.!

As Stone (1974) advocates, cross-validation technique can be used to get more predictive accuracy. Applied with the five-fold cross-validation data set, the simulation technique could be implemented to find out where the most appropriate cut point is for this data rather than 0.5. The steps of our practice are displayed as the following:

To begin with, we split five data sets of training data; we constructed five models by different parts of data. For example, the first part would be tested by the model based on second to fifth parts of training data. The second part one would be tested by the model constructed on 1^st, 3^rd, 4^th and 5^thparts of training data. And then 3^rd, 4^th and 5^thparts are in the same way. In this step, five models were derived toward each part.

Next, the probability was set from 0.01 to 0.99. For this probability, every single classification matrix was built by comparing the predicted probabilities and the set probability. And then five classification matrices would be combined to a complete training-set classification matrix.

Finally, F-measure would be calculated by the complete classification matrix. Refer to the simulation tool, the cut point to fit highest F-measure could be decided.

In this thesis, we applied three models to the CARVANA data. Implemented by BMA model, full generalized linear model and stepwise generalized linear model.

For BMA model, as Figure 3 to Figure 5 demonstrate, the cut points were set to be 0.64, 0.82 and 0.42 respectively. Based on these cut-points, we would employ them to assess the performance of prediction in different purpose. For full generalized linear model, as Figure 6 to Figure 8 suggest, the cut point were set to be 0.64, 0.82 and 0.41 respectively. Based on these cut-points, we would employ them to assess the performance of prediction in different purpose. For stepwise generalized linear model, as Figure 9 to Figure 11 suggest, the cut point were set to be 0.60, 0.83 and 0.40

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

respectively. Based on these cut-points, we would employ them to assess the performance of prediction in different measurement.

With stimulated cut points for each model, we could improve the performance of prediction and make better decision rather than traditional 0.5 threshold.

!

! !

Figure 3. Simulation of cut point in BMA model (F1-measure)

Figure 4. Simulation of cut point in BMA model (F0.5-measure)

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

! !

!

! !

!

! !

!

Figure 5. Simulation of cut point in BMA model (F2-measure)

Figure 6. Simulation of cut point in full GLM model (F1-measure)

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

! !

!

! !

Figure 7. Simulation of cut point in full GLM model (F0.5-measure)

Figure 8. Simulation of cut point in full GLM model (F2-measure)

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

! !

!

Figure 9. Simulation of cut point in Stepwise GLM model (F1-measure)

Figure 10. Simulation of cut point in Stepwise GLM model (F0.5-measure)

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

!

(Monteith et al., 2011)

(Berger and Delampady, 1987; Bradley, 1997; Breiman, 2001; Chen, Liaw and Breiman, 2004; Cortes and Mohri, 2004; Dietterich, 2000; Draper, 1995; Fernandez, Ley and Steel, 2001; Hoeting, Raftery and Madigan, 1996; Hoeting et al., 1999; Kass and Raftery, 1995; Kohavi, 1995; Lemeshow and Hosmer, 2000; Madigan and Raftery, 1994; Madigan, York and Allard, 1995; Merlise, 1999; Miller, 2002; Posada and Buckley, 2004; Raftery, 1996; Raftery et al., 2005; Raftery, Madigan and Hoeting, 1997; Stone, 1974; Tierney and Kadane, 1986; Tsao, 2014; Volinsky et al., 1997; Wang, Zhang and Bakhai, 2004; Wasserman, 2000; Zhu, 2004) 

Figure 11. Simulation of cut point in Stepwise GLM model (F2-measure)

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

!

3.5 BAYESIAN MODEL AVERAGING ANALYSIS!

After sampling and cut-point simulation steps, we would apply BMA method to bad-car forecasting. Within the data set, 17 predictors are available to be used and the number of input variables is 21, including dummy variables. Interaction is ignored here since it took too much time for computing all the Bayesian combinations. By the bic.glm commend in “BMA” package in R, 4,194,304 model combinations (" ) were taken into BMA computing consideration. With Occam’s Window rule mentioned before, there are 9 models with highest posterior model probabilities (cumulative posterior probability reach 0.999) were selected in final BMA model set. Among these models, one with highest posterior probability is 0.476. There are 5 to 8 predictive variables in each sub-model of BMA. Besides BMA model, stepwise model contains 12 input variables and complete model without selection included 21 predictive variables. The mean of input variables and selection results are shown in Table 4.

As TABLE 5 reveals, the " means that the posterior probability which the parameter effect doesn’t equal to zero given Data. It can be derived by compute how much ratio the predictive variable account for across all the models, The posterior means and posterior probabilities of the coefficients are also attached in BMA model. Interpreting the effect of a particular predictive variable on the likelihood of occurrence of event can be expressed in the following ways.

!

(15)

!

where A means a set of models selected in Occam’s Window and " stands the indicator function which equals to 1 when the " is in model Mk and 0 otherwise.

The interpretation of " posterior probability is recommended to follow the rules provided by Kass and Raftery (1995). The rules are shown as follows:

!

Next, The Bayesian point estimate (posterior mean) and standard error (posterior standard deviation) of are given by " are shown as:

!

BMA method and stepwise method is that the point estimates are combined by all possible effects and weighted with model posterior probabilities.

There are fifteen coefficients’ estimates in BMA final model, including intercept, VehYear, WheelTypeID, WheelTypeDUMMY1, WheelTypeDUMMY2, W h e e l T y p e D U M M Y 3 , V e h 0 d o , N a t i o n a l i t y D U M M Y 2 , MMRAcquisitionAuctionAveragePrice, MMRAcquisitionAuctionCleanPrice, MMRAcquisitonRetailCleanPrice, MMRCurrentRetailCleanPrice, VehBCost, IsOnlineSale, WarrantyCost.

Aside from BMA model, we apply step command in R to build stepwise Logistic model. In stepwise model, the final model is selected by Akaike information criterion (AIC) in stepwise procedure. There are fourteen coefficients’ estimates in stepwise model. P-values are attached and they are all less than 0.1. AIC is defined in this command as " for large n, where L is the likelihood, edf is the number of parameters in the model and k is the penalty term

Poor evidence

Table 4!

Models in different model selection strategies.

Bayesian Model Average

WheelTypeID 2.8022 T T T

WheelType!

IsOnlineSale 0.0224 T T T

WarrantyCost 0.0489 T T T

No. of variables 6 5 7 7 8 7 7 7 7 12 21

Posterior model 0.476 0.147 0.125 0.066 0.046 0.044 0.035 0.033 0.027

‧

Table 5!

Comparison of parameter estimates for the variables in different models

BMA STEPWISE COMPLETE

Coefficient Pr(>|z|) Coefficient Pr(>|z|)

Intercept -0.6234 0.7850 100 -2.17921 <0.0001 -1.8795 0.0178

VehYear 0.4408 0.0233 100 0.42351 <0.0001 0.4403 <0.0001

Transmission 0 -0.0394 0.6863

WheelTypeID 0.1459 0.3868 12.5 1.07055 <0.0001 0.9094 0.0199

WheelType!

DUMMY1 -0.2665 0.0571 85.3 -1.22247 <0.0001 -1.0227 0.0092

WheelType!

DUMMY2 3.0679 0.1152 87.5 0.7321 0.5313

WheelType!

DUMMY3 -0.2582 0.6873 12.5 -1.92850 <0.0001 -1.8612 0.0199

Veh0do 0.1158 0.0209 100 0.11503 <0.0001 0.1055 <0.0001

Nationality 

DUMMY1 0 0.81518 0.0154 0.5536 0.0893

Nationality 

DUMMY2 0.0060 0.0305 4.4 0.0922 0.1444

Nationality 

DUMMY3 0 0.0579 0.5152

MMRAcquisitionAu

ctionAveragePrice -0.1186 0.0220 100 -0.12155 <0.0001 -0.0697 0.0268

MMRAcquisitionAu

ctionCleanPrice -0.0015 0.0090 3.3 -0.0333 0.2427

MMRAcquisitionRet

ailAveragePrice 0 -0.0117 0.6693

MMRAcquisitonRet

ailCleanPrice -0.0057 0.0228 7.3 -0.06157 0.0269 -0.0903 0.0015

MMRCurrentAuctio

leanPrice 0.0042 0.0202 4.6 0.05629 0.0465 0.0878 0.0023

VehBCost -0.0985 0.0214 100 -0.12334 <0.0001 -0.1087 <0.0001

IsOnlineSale -0.0069 0.0468 6.6 -0.29550 0.0204 -0.3054 0.0192

P(βj ≠ 0 D)

which equals to 2 corresponding to the traditional AIC. The maximum likelihood parameter estimates of the single model selected from the stepwise procedures.

Thirteen coefficients’ estimates in stepwise model.

In these two models, there are four predictor variables of 21 input variables (VehYear, Veh0do, MMRAcquisitionAuctionAveragePrice and VehBCost) are related to response variable significantly. The Bayesian posterior probabilities and P-values of stepwise Logistic regression model are matched and present strong evidences (" " 0.99 and P-values<0.0001, respectively). Six predictors of all ones (Transmission, NationalityDUMMY3, MMRAcquisitionRetailAveragePrice, M M R C u r re n t A u c t i o n Av e r a g e P r i c e , M M R C u r re n t A u c t i o n C l e a n P r i c e a n d CurrentRetailAveragePrice) are suggested that there exists no relationship with occurrence of bad-purchasing in both BMA and stepwise methods.

S e v e n p r e d i c t i v e v a r i a b l e s o f a l l p r e d i c t o r s ( W h e e l Ty p e I D , WheelTypeDUMMY1, WheelTypeDUMMY3, MMRAcquisitonRetailCleanPrice, MMRCurrentRetailCleanPrice, IsOnlineSale, WarrantyCost) have weaker relationship with response variable via Bayesian posterior probabilities than P-values. Three predictive variables of all predictors (WheelTypeDUMMY2, NationalityDUMMY2, MMRAcquisitionAuctionCleanPrice) partially agree there exists the effect toward the occurrence of bad-purchasing event in BMA approach. Only one input variable (NationalityDUMMY1) is revealed partially in stepwise model.

In Logistic regression, the coefficients of predictors reveal the importance and effects of predictors. For example, Our results suggest that considering all possible BMA sub-models, the Bayesian posterior mean for is 0.4408, presenting that an additional one standardized year increase, 55 percent (= exp(0:02×10)−1) increase in the odds of a bad-car purchasing event occurring. The main predictors selected in both BMA and stepwise approaches reveal similar estimates for the parameters. The close point estimates in BMA and stepwise show the similar exploratory information toward the predictor.

Additionally, those variables partially appeared in BMA method speak volumes about a possible advantage of BMA. Even the relationship between the predictive variable and the response variable is weak, BMA method will take it into account.

3.6 PREDICTIVE PERFORMANCE!

In order to assess the performance of BMA method for prediction, we would employ data partition mentioned in chapter 3.3 to use. Training set includes 58,376 trade records, and 14,360 samples (7,180 records are positive events) would be selected via under-sampling technique in 3.3. Testing set contains 14,594 records and 1,795 observations of them are bad-purchasing records. Based on cut point simulated by cross-validation process in 3.4, we would apply the data in training set to construct BMA, stepwise and complete models to compare the performance of prediction.

The data in training set was used to identify the relationships between predictors and bad-purchasing events by using these three analysis methods. Then we compared these predictive performance of three models by calculating the F0.5, F1, F2 score with testing set. Additionally, the precision and recall for each model in Table 6 to Table 8 with the test data set. In these tables, the details of measurements are displayed for five times in each model method.

Because the samples in training and testing sets might be different, it is reasonable to implement these model procedures for many times and take the averaging results.

Therefore, we took much time to run the computation process and Table 9 presented the averaging result of one hundred computer simulations.

As Table 9 suggests, no matter in F0.5, F1 or F2 measures, BMA model has slightly better performance prediction. Since BMA has outstanding performance for precision of prediction, the mean of F0.5 measure is better than the others. So does F1 measure and F2 measure. Although BMA presents higher mean of assessments for this case, the variance of F-measures in BMA are not smaller than the others.

AUC, which terms the distinguish ability of model, reveals good level for each model. All the AUC values are fallen in the 0.70 to 0.75. The most outstanding model is stepwise chosen model, the second is BMA model and the third is complete one.

Often, there is a trade-off relationship between precision and recall. For example, If less positive test outcomes in A model than B model for the same cut-point, then the result of A model will have less False Positive, and more False Negative. Therefore, from the definitions of precision and recall, you'll see that the precision is likely (not always, because it also depends on the number of True Positive) to increase, and recall is likely to decrease. Accordingly, we can discover something interesting in our analysis. For this car

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

data, equipped with F1-oriented cutting point, BMA method is the best in precision assessment, the second is complete model and the third is stepwise approach.

Conversely, in recall, stepwise approach gives the greatest performance, the second is complete one and the third is BMA. The variances of precision and recall in complete model are larger than BMA and stepwise ones. The reason of this consequence might be explained as the following: Based on similar F1-oriented cut-points (both around 0.6) in BMA model and stepwise one, the positive test outcomes in BMA model were more than the ones in stepwise one. Namely, Less test outcomes were judged as kicked-vehicle in BMA than the ones in stepwise model. (shown in Table 10)

Overall, the performance of prediction does not meet our expectation because of low precision and recall. The possible reasons may be the limitation of logistic regression (such as linear assumption) , ignoring the interaction of predictors and hazy-understanding for car-domain knowledge. BMA, which belongs to ensemble method, shows better

在文檔中貝式模型平均法在預測分析之應用 - 政大學術集成 (頁 12-0)

Assessment Review

‧

!

!

! !

!

True positive + False negative

F

= (1+ β

) ⋅ precision ⋅recall

!

3.1 THE APPLICATION OF CARVANA CASE

!

AUC The distinguish ability of model

AUC<0.6 Poor

0.6<AUC<0.7 Normal

0.7<AUC<0.8 Good

0.8<AUC<0.9 Great

AUC>0.9 Amazing

‧

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

3.2 DATA PREPROCESSING

!

!

!

!

!

‧

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

!

!

‧

3.3 DATA PARTITION AND SAMPLING

!

Training set (80%)

Test set (20%)

‧

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

3.4 DECISION-MAKING CUT POINT

‧

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

!

! !

‧

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

! !

! !

!

! !

!

! !

!

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

! !

! !

! !

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學