Making the World More Productive™
Software Course and the Case Practice
Introduction of Credit Risk Data
Cheyu HUNG / 洪哲裕
StatSoft Holdings, Inc., Taiwan Branch
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 1
Credit Risk Data
■ The application for this data
■ Variables in the Credit Risk data
Applications
■ Practically all data need some preparation work.
■ Handling missing data and outliers. ■ Selecting important variables.
■ Sampling
■ Classification tasks have many uses
■ Classify a variable with 2 or more groups ■ Find probability of a predicted classification
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 3
Application for Credit Risk Data
■ A financial institution needs a way to decide if and how much credit to
extend to customers who apply. This is our business need Automated Neural Networks
■ The goals of the data mining project include:
■ Determining the variables that are best predictors of credit risk,
■ Finding a high performance predictive model that classifies customers, ■ Deploying that model to make decisions on credit application,
Next Steps for the Data Mining Project
■ The project goals are expressed and data are available.
■ The next step is to understand the data. We will do this by reviewing the data graphically.
■ Later the data needs prepared using data cleaning tools.
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 5
Introduction to Credit Risk Data
■ In this session, we discussed the business need and application for the data.
■ We reviewed the variables
Next Steps
■ With an understanding of the application, we are ready to start working with the data.
■ The next session will look at query and import options to bring the data into STATISTICA from a database or external format.
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 7
Initial Graphical Review
■ Review the data graphically to reveal issues with the data that will need addressed in the data cleaning phase.
Data Cleaning for Outliers
■ Detecting outliers
■ Graphically
■ Statistical tests
■ Handling outliers
■ What caused the outlier?
■ What can be done to clean the data?
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 9
Detecting Outliers Graphically
■ A box plot can show outliers in continuous data.
■ A histogram can show outliers in categorical data.
Detecting Outliers with Statistical Tests
■
Grubbs test
■
Normal distribution
■
Percentiles
■
Tukey
Descriptive Statistics
Variable
Grubbs Test
Statistic
p-value
Age
3.458622 0.546786
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 11
Handling Outliers
■
Remove the outliers
■
Are the outliers data entry errors?
■
Are the outliers due to entries in the data set that
don’t belong there?
■
Keep the outliers
■
Are the outliers legitimate points that simply have
extreme values?
C&RT for Classification
■ What is C&RT ■ Misclassification cost ■ Stopping conditions ■ Cross Validation ■ SurrogatesSuppose these terms above are known and learned
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 13
C&RT
■ Classification and Regression Trees
■ A nonparametric data mining algorithm for generating decision trees.
■ Splits are made by variables that best differentiate the target variable.
■ Each node can be split into two child nodes
Misclassification cost
■ Misclassification is inevitable since no model is perfect.
■ Some misclassifications are worse than others, so STATISTICA allows you to account for this with misclassification costs.
Predicted
Good Credit Predicted Bad Credit Observed
Good Credit Correct% X margin lost (by unit) Incorrect % X margin lost (by unit) Observed
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 15
Stopping Conditions
Variety, but choose one is enough
■ Decision Tree Pruning
■ Misclassification error or Deviance
■ Select a minimum number of cases for a node to be considered for
splitting.
■ Select a maximum number of total nodes.
■ FACT direct stopping
Cross Validation
■ Cross Validation is a method to prevent over fitting the data and failing to generalize to new data.
■ V-Fold Cross Validation
■ Good for smaller data sets, when holding out a test sample is not feasible. ■ Repeats the analysis on V different random samples taken from the data
and compares the resulting trees.
■ Train – Test Sample Cross Validation
■ Test sample data is used to determine if the right size tree was found based on how well the tree performs on the test data.
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 17
Surrogates
■ In deployment stage, surrogate splits are used in place of the actual split variable when its value is missing.
■ The surrogate is the next best split variable.
CHAID Trees
■ What is CHAID
■ Analysis options
■ Exhaustive CHAID
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 19
CHAID
■ Chi Square Automatic Interaction Detection
■ Performs multi-level splits where C&RT uses binary splits.
■ Well suited for large data sets.
CHAID Analysis Options
■ ANOVA type design
■ Misclassification costs
■ Cross Validation
■ Bonferroni adjustment
Ri sk esti m ate s (credi t scori ng for m odel bui l di ng.sta) Dependent vari abl e: Credi t Rati ng
Opti ons: Cate gori cal response Ri sk
esti m ate
Standard error
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 21
CHAID Analysis Options
■ Stopping parameters
■ Minimum cases for a node to be split ■ Maximum number of total nodes ■ Probability for merging and splitting
Exhaustive CHAID
Optional to proceed or NOT
■ More computationally intensive, for large data sets may require extended computations.
■ Performs more thorough merging and testing of predictor variables to find the best split candidate.
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 23
Boosted Trees
■ What are boosted trees?
■ Analysis options
■ Stopping parameters
Boosted Trees
■ The idea of Boosted trees is to build a sequence of simple trees, weighting them inversely by
misclassification.
■ The final classification for
deployment is based on voting from these simple trees.
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 25
Boosted Trees Analysis Options
■ Learning Rate
■ Number of additive terms
■ Random test data proportions
■ Subsample proportions
Boosted Trees Stopping Parameters
■ Minimum number of cases
■ Maximum number of levels
■ Minimum number in child nodes
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 27
Random Forests For Classification
■ What is Random Forests classification?
■ Analysis options
Random Forests
■ Random Forests builds a series of trees.
■ Each tree predicts a classification.
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 29
Random Forest Analysis Options
■ Number of predictors - optimal setting is log2(M+1)
■ Number of trees
■ Sampling proportions
Comparing Performance across Models
■ Generating deployment code
■ Rapid Deployment results
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 31
Generating Deployment Code
Optional topic
■ Many Data Mining and Statistics tools in STATISTICA have the ability to generate deployment code.
■ STATISTICA Visual Basic
■ C/C++ language ■ PMML Script
■ Deployment to STATISTICA Enterprise
■ Load multiple data mining models
■ Make predictions on new data
■ Generate lift and gains charts comparing models
■ Write predictions back to data file
Rapid Deployment
Optional topic
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 33
■ Lift chart
■ Shows the effectiveness of the model compared to no model.
■ Gains chart
■ Shows the percentage of
observations correctly classified for the given category, in this case,
bad.
Lift and Gains Charts
Optional topic
Voting Across Models
■ What is voting or bagging?
■ Instability or results in small datasets
■ Reviewing results Input Data C&RT CHAID Random Forest
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 35
What is Voting or Bagging?
■ Data Mining offers a variety of model building tools, so a large number of models can be created in a given project.
■ None of these models will fully capture the underlying relationship of the data.
■ Using an ensemble of models together to determine the final prediction is called voting or bagging.
What is Voting or Bagging?
What is Voting or Bagging?
Input Data C&RT CHAID Random Forest Boosted Vote Final Prediction© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 37
Instability of Results in Small Data Sets
■ Data Mining can model complex relationships between variables.
■ Without ample data, instability can be an issue.
■ Using multiple models with voting combats the problem of instability in results. The ensemble of models typically outperforms individuals.
■ Rapid Deployment gives
predictions based on voting. That output can generate other
© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 39
Q/A
■ Welcome to mail us!
■ service@statsoft.com.tw