Chapter 3 The Proposed Hybrid Model Approach
3.2 Procedure of Constructing Hybrid Credit Scoring Model
The proposed procedure of phase 1 can be shown in Fig. 7. Each step in phase1 is described as follows:
Figure 7. Flowchart of the proposed hybrid model
Step1: Data collection and cleaning
Loan applicants in this study are mid-sized companies whose financial statements are not as credible as those of public offering companies. Therefore, financial statement is only one part of considerable factors in this study. Loan companies usually adopt financial variables (quantitative factor) and non-financial variables (qualitative factor) simultaneously to increase model accuracy and reliability.
This study collected loan data from a loan company in Taiwan in 2000 to 2003 as sample data and divided the dataset into two categories: “bad loan” and “good loan”.
If a loan applicant is classified into “bad loan” category, the loan will be default and become a bad debt according to the proposed credit scoring model. On the contrary,
“good loan” means the loan applicant can reimburse its debt in time.
Step2: Perform CART
The procedure of constructing CART can be described as follows:
Step 1. Decide impurity function.
Step 2. Grow tree by maximizing the decrease of tree impurity until the tree size becomes as large as possible.
Step 3. Prune tree structure.
Step 4. Use proper estimation method to obtain the honest estimator of tree classifier. The default setting is 10-fold cross validation.
Step 5. Interpret the results.
Step3: Record CART`s split variables and predictive outcomes
In Step3, split variables of CART models can be deemed as the influential variables and should be recorded for further Steps. Similarly, CART`s predictive
outcome and CART`s predictive categorical probabilities can be deemed as important compressed information derived from CART model and should be retained as well.
As a result, even the number of input variables of the hybrid model decreases, the model accuracy can still be retained using CART`s predictive outcomes and CART`s predictive categorical probabilities as input variables.
Step4: Use recorded variables and predictive outcomes as input variables of following model
CART has selected significant variables in Step3, therefore most of the relevant information are retained in the following three variables: “CART`s predictive categorical probability of bad loan”, “CART`s predictive categorical probability of good loan” and “CART`s predictive outcome”. These variables can be used as augmented input variables of the subsequent model to enhance the accuracy of the hybrid model. Fig. 8 displays an example of CART`s recorded variables which can be used as input variables of following BPN model. Similarly, these three recoreded variables can be introduced to other algorithms such as LDA, LR, etc. This study also adopted many data mining algorithms to replace BPN to examine the effectiveness of proposed hybrid model. The cases given below described the credit scoring models constructed using the algorithm specified in each case.
Case 1. Linear Discriminant Analysis (LDA):
Specify appropriate prior probabilities for each category and utilize LDA to obtain results. LDA is performed using SAS 8.1 and the classification result is evaluated through N-fold cross validation.
Case 2. Quadratic Discriminant Analysis (QDA):
Specify appropriate prior probabilities for each category and utilize QDA to obtain results. QDA is performed using SAS 8.1 and the classification
result is evaluated through N-fold cross validation.
Case 3. Logistic Regression (LR):
Specify appropriate probability threshold value and utilize LR to obtain results. LR is performed using SAS 8.1 and the classification result is evaluated through hold-out method. 80% of data are chosen randomly to construct the LR model and the rest 20% of data are taken to validate the accuracy of LR model.
Case 4. Back Propagation Neural network (BPN):
The architecture of BPN [10] is decided to be three-layer BPN with completely interconnected neurons. With regard to the number of hidden nodes, this study adopted cascade learning rule to decide the proper number of hidden nodes. That is, cascade learning rule implies that hidden nodes increase gradually until the prediction accuracy of “testing bad loan” is not increased. As regards to the learning rate, momentum, and learning epochs, this study decided to use a small learning speed and long learning epochs to avoid the disturbance of overfitting. However, testing accuracy is another critical perspective when setting the number of epochs. The detail setting of network parameters are adhere to above principles. BPN is performed using Neural Shell2 (NeuralWare) and the classification result is evaluated through hold-out method. 80% of data are chosen randomly to train the BPN model and the rest 20% of data are used to validate the accuracy of BPN model.
Case 5. Probabilistic Neural network (PNN):
The architecture of PNN [10] can be easily determined from the observations of dataset. The only parameter which necessitates to be manually set is the smoothing parameter. This study adopts cascade
learning to decide best smoothing parameter. PNN is performed using Neural Shell2 (NeuralWare) and the classification result is evaluated through hold-out method. 80% of data are chosen randomly to train the PNN model and the rest 20% of data are used to validate the accuracy of PNN model.
Case 6. General Regression Neural network (GRNN):
The architecture of GRNN can also be easily determined from the observations of dataset as the same as PNN. The only parameter which necessitates manually setting is the smoothing parameter. This study here also adopts cascade learning to decide best smoothing parameter. GRNN is performed using Neural Shell2 (NeuralWare) and the classification result is evaluated through hold-out method. 80% of data are chosen randomly to train the GRNN model and the rest 20% of data are used to validate the accuracy of GRNN model.
Case 7. Group Method of Data Handling (GMDH):
GMDH is performed using Neural Shell2 (NeuralWare) and the classification result is evaluated through hold-out method. 80% of data are chosen randomly to train the GMDH model and the rest 20% of data are used to validate the accuracy of GMDH model.
Case 8. K-Nearest Neighbor (KNN):
It needs to set two parameters in training KNN [10], the first is the number of “K”, which represents the number of nearest neighbors, and the other is the measure of distance. This study utilizes Euclidean distance as measure of distance while performing KNN. As for the number “K”, rule of thumb (trial and error) method is employed to decide the best value for K. KNN is performed using Matlab6.5
(MathWorks inc) and the classification result is evaluated through hold-out method. 80% of data are chosen randomly to train the KNN model and the rest 20% of data are used to validate the accuracy of KNN model.
Case 9. Learning Vector Quantization (LVQ):
It needs to set three parameters mainly in training LVQ [10]. The first parameter is the number of prototypes, and another is learning rate and the other is the measure of distance. As for the number of initial prototypes, rule of thumb (trial and error) method is employed to decide the best value for the number of prototypes. Besides, the initial prototypes can be determined through random selection from the training samples. With respect to learning rate, preliminary experiments indicated the learning rate has no significant impacts for LVQ results. Hence this study set the value 0.1 as the learning rate. Similarly, this study utilizes Euclidean distance as measure of distance while performing LVQ.
With respect to learning epochs, the number of learning epochs is not the critical factor in training LVQ because LVQ converges very fast. Thus the value of learning epochs is set to be 15. LVQ is performed using Matlab6.5 (MathWorks) and the classification result is evaluated through hold-out method. 80% of data are chosen randomly to train the LVQ model and the rest 20% of data are used to validate the accuracy of LVQ model.
Figure 8. Hybrid BPN credit scoring model
Step5: Compare the accuracy of hybrid credit scoring model and select the best one as the final model.
The final credit scoring model is selected from the nine cases described in Step4.
In other words, nine hybrid credit scoring models are constructed in Step4. According to the model evaluation criterion, select the best one as the final hybrid credit scoring model from the nine hybrid credit scoring models.
3.3 Establish Prediction Model of Default Period
Phase 2:Establish Prediction Model of Default PeriodFor bad loaners, the time period between the loan start and the loan default is defined as the “default period”. Default period means the time period in which loaner still reimburse his debt regularly, the longer default period means the less potential profit loss to loan companies. On the contrary, the shorter default period represents the greater default risk. This phenomenon often makes loan companies unable to take proper reactions in time to the loan applicants with short default period.
Therefore, loan companies can take precautions and adopt corresponding reactions to the possible-default cases by reexamining the predicted default period when the loan applicant is classified into “Bad loaner category” in phase 1. Fig.9 describes the proposed procedure of phase 2.
Figure.9 Flowchart of default period prediction model Each step in phase2 is described as follows:
Step 1: Data collection and cleaning
The term “Default period” is only defined for bad loaners. The phase 2 simply choose bad loan data as sample data. Therefore, a prediction model of default period can be established through the bad loan cases. In addition, casewise deletion is adopted in this step.
Step 2: Model construction
This study employs three data mining algorithms to predict default period and the result of the three models are compared with the linear regression model. Three data mining algorithms are given below.
Case 1.Back Propagation Neural network (BPN):
The setting of parameters and network architecture are determined as mentioned in phase 1.
Case 2.General Regression Neural network (GRNN):
The setting of parameters and the GRNN network architecture are determined as mentioned in phase 1.
Case 3.Group Method of Data Handling (GMDH):
The setting of parameters and GMDH architecture are determined as mentioned in phase 1.
Step 3: Model comparison
The criterion for model comparison is mean square error (MSE). MSE is the smaller the better. The small MSE represents small difference between predicted output and the target. As a result, select the model with minimum value of testing MSE as the final model of default period. The MSE of linear regression is treated as a
benchmarking method in this step.
Chapter 4 Illustrative Examples
4.1 Description of Sample Data
The illustrative examples in this study consisted of 2080 commercial bank loaners, of which 1709 good loan cases and 371 bad loan cases. These data were obtained from a famous financial loan company in Taiwan for the period 2000 to 2003.
Each loan case included 31 variables of interest and some of these variables are non-financial variables. The variables are predetermined by the financial loan company. Detail descriptions of variables in the study are summarized in Table 3. It is noticeable that there are 14 financial variables and 17 non-financial variables, in which financial variables were directly measured from the financial statements and non-financial variables were indirectly measured by analysts` subjective determination. From the practical point of view, both financial and non-financial variables were used to construct the credit scoring model in this study.
Table 3. Variable Description Variable Code Rating Items Variable
Code Rating Items K83 Own capital rate N1 History
K85 Debit ratio N2 Employee`s
Loyalty Financial
structure (N6)
K87 Fix ratio N3a Background
K93 Current ratio N3b Capability K95 Rapid ratio N4 Company Wealth Liquidity
Capability
(N7) K97 DSR N5 Credibility of
Financial statement Management
Capability K100 Turnover days
of Net value N11 Legal Policy
K102
Turnover days of account receivable
N12 Economic Factor (N8)
K104 Inventory
Turnover N13 Industry Trend K107 Gross profit rate N14 Production
capability K109 Net profit rate N15 Marketing & Sales
capability
sales volume Net_Value Net Value Default
Period Default Period SCORE Subtotal scores Capital Capital of
company
4.2 Perform CART
This study used CART 5.0 sponsored by Salford systems to perform CART.
After setting the minimum complexity α equal to zero and favor even split equal to 1, many preliminary experiments indicated that appropriate CART models can be obtained by adjusting prior probabilities shown in table 2.
Besides, this study repeats the proposed procedure of phase 1 six times to generate six different CART candidate models, and then use the six CART candidate models to construct the hybrid models. This practice intended to verify the effectiveness of the hybrid models produced by different CART candidate models. In
other words, if the hybrid model performs well under whichever the CART candidate models is selected, the hybrid model approach can be deemed as an effective methodology. Table 4 and displays the six different CART candidate models. The detail model of six CART candidate models can be found in Appendix.
Table 4. CART candidate models Testing Accuracy CART model Impurity
function
Number of split
variables Good
loan Bad loan
Abbreviation of the model
Candidate1 GINI 12 57.109 71.429 Cart_1
Candidate2 GINI 14 54.535 72.507 Cart_2
Candidate3 GINI 11 50.673 73.315 Cart_3
Candidate4 GINI 10 51.668 73.046 Cart_4
Candidate5 GINI 15 55.12 72.237 Cart_5
Candidate6 GINI 9 51.551 73.315 Cart_6
The split variables of each produced CART model are listed in table 5.
Significant reduction of input variables can be observed in table 5. Furthermore, these split variables can be regarded as influential variables and be used to construct the hybrid model.
Table 5. CART`s split variables CART
model
Number of split variables
Split variables
Candidate1 12 N4 N5 N6 N7 N14 K95 K97 K104 K107 K109 Capital Net_Value
Candidate2 14 N3 N4 N5 N6 N7 N9 N14 N15 K104 K107 K109 K116 Capital Net_Value
Candidate3 11 N4 N6 N9 N14 N15 N17 K85 K97 K109 Capital Net_Value
Candidate4 10 N4 N6 N9 N14 N15 K85 K97 K109 Capital Net_Value Candidate5 15 N3 N4 N6 N7 N9 N14 N15 K87 K95 K97 K104 K107 K109
K116 Net_Value
Candidate6 9 N4 N6 N7 N14 N15 N17 K85 K97 Net_Value Apparently, the original CART does not provide satisfactory results under anyone of the six candidate models.
Other original credit scoring models were also established and summarized in Table 6 as benchmarking methods. This study adopted an extensive trial and error method to find the best parameter setting for each model. After many preliminary experiments, the best parameter setting and testing accuracy of each original model can be obtained and showed in Table 6.
Table 6. Comparison of original credit scoring models
Testing Accuracy (%)
Model Abbreviation
Bad Loan Good Loan
Notes
Linear Discriminant Analysis LDA 79.51 50.46 Priors:
0.63 :0.37 Quadratic Discriminant Analysis QDA 76.01 51.61 Priors:
0.66 :0.34
Logistic Regression LR 79.2 51 Probability
level: 0.12 Classification & Regression Tree CART 73.04 51.66 Priors:
0.59 :0.41 Probabilistic Neural Network PNN 52.05 77.26 Smoothing
factor 0.355 Backpropagation Neural Network BPN 82.19 51.31 Hidden node:
15 General Regression Neural
Network GRNN 81.03 62.56 Smoothing
factor 0.6583 Group Method of Data Handling GMDH 82.27 50.74 Criterion value
0.150836
K-Nearest Neighbor KNN 25.28 86.93 K=1
Learning Vector Quantization LVQ 29.88 93.27 Prototypes: 400
4.3 Record CART`s Split Variables and Predictive Outcomes
Spilt variables, predictive categorical probabilities and predictive result of CART were recorded and used as input variables for the further hybrid models.
4.4 Use recorded variables and predictive outcomes as input variables of following model
The three variables: spilt variables, predictive categorical probabilities and predictive result were used as input variables in LDA, QDA, BPN, etc. The input variables of the following hybrid model are summarized in Table 7.
Table 7. Input Variables of following hybrid models Input variables of following hybrid models CART
model
Split variables
Augmented variables from
CART model Candidate1 N4 N5 N6 N7 N14 K95 K97 K104 K107 K109
Capital Net_Value
Candidate2 N3 N4 N5 N6 N7 N9 N14 N15 K104 K107 K109 K116 Capital Net_Value
Candidate3 N4 N6 N9 N14 N15 N17 K85 K97 K109 Capital Net_Value
Candidate4 N4 N6 N9 N14 N15 K85 K97 K109 Capital Net_Value
Candidate5 N3 N4 N6 N7 N9 N14 N15 K87 K95 K97 K104 K107 K109 K116 Net_Value
Candidate6 N4 N6 N7 N14 N15 N17 K85 K97 Net_Value
Predictive probability of bad loan of CART.
Predictive probability of good loan of CART.
Predictive outcome of CART.
The procedure of constructing various hybrid models followed the principles described in Step 4 in section 3.2. This study used SAS 8.1 to perform LDA, QDA
and LR analysis. According to each CART candidate model, a corresponding hybrid model was built and shown as table 8.
Case 1. Linear Discriminant Analysis (LDA):
The performance of hybrid LDA model and the original LDA model was compared and the results were listed in Table 8 and Table 9. Obviously, the accuracy of hybrid LDA model for the testing bad loan was significantly higher than the original LDA by 5% no matter which CART candidate model was used.
Table 8. Hybrid LDA Performance Table 9. Original LDA Performance
Hybrid LDA Original LDA
N-fold CV
Case 2. Quadratic Discriminant Analysis (QDA):
Table 10 and Table 11 also indicated that the hybrid QDA had better prediction accuracy than the original QDA model. Obviously, the testing bad loan accuracy of hybrid QDA model increased at least by 7% compared to the original QDA no matter which CART candidate model was used.
Table 10. Hybrid QDA Performance Table 11. Original QDA Performance
Hybrid QDA Original QDA
N-fold CV
Case 3. Logistic Regression (LR):
Similar results as in Case 1 and Case 2 can be observed in Table 12 and Table 13.
This also indicated that the hybrid LR model significantly performed better than the original LR model according to the specified model evaluation criterion.
Table 12. Hybrid LR Performance Table 13. Original LR Performance
Hybrid LR Original LR
Testing
Case 4. Back Propagation Neural network (BPN):
The procedure of BPN can be stated as follows: a very small learning rate at 0.001, momentum as 0.85, and learning epoch as 2000 are set in the BPN training period to avoid overfitting problem and fluctuation of predictive accuracy. With regard to the number of hidden nodes, this study adopted cascade learning rule to decide the proper number of hidden nodes. Cascade learning rule implies that hidden nodes increase gradually until the accuracy of testing bad loan stop increasing. For instance, the results of cascade learning procedure were plotted in Fig.10 and Fig.11.
Moreover, Fig.10 and Fig.11 also indicated that the prediction accuracy of hybrid BPN model produced by Cart_1 increased up to 10% as compared to the original BPN model. Other hybrid BPN models also have the same improvement on the bad loan accuracy.
Good loan accuracy is around 50%~55%
Bad loan accuracy is around 75%~80%
Fig.10. Original BPN model accuracy
Good loan accuracy is around 55%~60%
Bad loan accuracy is around 87%~90%
Fig.11. Hybrid BPN produced by Cart_1
Good loan accuracy is around 55%~60%
Bad loan accuracy is around 87%~90%
Fig.12. Hybrid BPN produced by Cart_2
Good loan accuracy is around 55%~58%
Bad loan accuracy is around 85%~90%
Fig.13. Hybrid BPN produced by Cart_3
Good loan accuracy is around 50%~55%
Bad loan accuracy is around 87%~90%
Fig.14. Hybrid BPN produced by Cart_4
Good loan accuracy is around 55%~60%
Bad loan accuracy is around 90%~95%
Fig.15. Hybrid BPN produced by Cart_5
Bad loan accuracy is around 85%~90%
Good loan accuracy is around 50%~55%
Fig.16. Hybrid BPN produced by Cart_6
Obviously, Fig.11-Fig.16 indicated the significant effectiveness by using hybrid model approach. The prediction accuracy of bad loan increases at least by 10%~15%.
The accuracy of good loan also increases by 5%. The performance of the proposed hybrid BPN exceeds what we expected according to specified model evaluation criterion.
Case 5. Probabilistic Neural network (PNN):
Even Probabilistic Neural network (PNN) is adopted, the same improvement of prediction accuracy can be obtained in Table 14 and Table 15. The results in these tables also indicated that hybrid PNN model performed significantly better than the original PNN model.
Table 14. Hybrid PNN Performance Table 15. Original PNN Performance
Hybrid PNN Original PNN
Testing
Case 6. General Regression Neural network (GRNN):
As compared to the original credit scoring models, GRNN performed best among all original models. The performance of hybrid GRNN model is still quite good.
Almost 5% to 10% accuracy improvement was obtained when hybrid GRNN model was employed. Table 16 and Table 17 indicated that hybrid GRNN model performed significantly better than the original models.
Table 16. Hybrid GRNN Performance Table 17. Original GRNN Performance
Table 16. Hybrid GRNN Performance Table 17. Original GRNN Performance