Introduction to Credit Risk Data_R1

(1)

Making the World More Productive™

Software Course and the Case Practice

Introduction of Credit Risk Data

Cheyu HUNG / 洪哲裕

StatSoft Holdings, Inc., Taiwan Branch

(2)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 1

Credit Risk Data

■ The application for this data

■ Variables in the Credit Risk data

(3)

Applications

■ Practically all data need some preparation work.

■ _{Handling missing data and outliers.} ■ _{Selecting important variables.}

■ _Sampling

■ Classification tasks have many uses

■ _{Classify a variable with 2 or more groups} ■ _{Find probability of a predicted classification}

(4)

Application for Credit Risk Data

■ A financial institution needs a way to decide if and how much credit to

extend to customers who apply. This is our business need Automated Neural Networks

■ The goals of the data mining project include:

■ _{Determining the variables that are best predictors of credit risk,}

■ _{Finding a high performance predictive model that classifies customers,} ■ _{Deploying that model to make decisions on credit application,}

(5)

Next Steps for the Data Mining Project

■ The project goals are expressed and data are available.

■ The next step is to understand the data. We will do this by reviewing the data graphically.

■ Later the data needs prepared using data cleaning tools.

(6)

Introduction to Credit Risk Data

■ In this session, we discussed the business need and application for the data.

■ We reviewed the variables

(7)

Next Steps

■ With an understanding of the application, we are ready to start working with the data.

■ The next session will look at query and import options to bring the data into STATISTICA from a database or external format.

(8)

Initial Graphical Review

■ Review the data graphically to reveal issues with the data that will need addressed in the data cleaning phase.

(9)

Data Cleaning for Outliers

■ Detecting outliers

■ Graphically

■ _{Statistical tests}

■ Handling outliers

■ What caused the outlier?

■ What can be done to clean the data?

(10)

Detecting Outliers Graphically

■ A box plot can show outliers in continuous data.

■ A histogram can show outliers in categorical data.

(11)

Detecting Outliers with Statistical Tests

■

Grubbs test

■

Normal distribution

■

Percentiles

■

Tukey

Descriptive Statistics

Variable

Grubbs Test

Statistic

p-value

Age

3.458622 0.546786

(12)

Handling Outliers

■

Remove the outliers

■

Are the outliers data entry errors?

■

Are the outliers due to entries in the data set that

don’t belong there?

■

Keep the outliers

■

Are the outliers legitimate points that simply have

extreme values?

(13)

C&RT for Classification

■ What is C&RT ■ Misclassification cost ■ Stopping conditions ■ Cross Validation ■ Surrogates

Suppose these terms above are known and learned

(14)

C&RT

■ Classification and Regression Trees

■ A nonparametric data mining algorithm for generating decision trees.

■ Splits are made by variables that best differentiate the target variable.

■ Each node can be split into two child nodes

(15)

Misclassification cost

■ Misclassification is inevitable since no model is perfect.

■ Some misclassifications are worse than others, so STATISTICA allows you to account for this with misclassification costs.

Predicted

Good Credit Predicted Bad Credit Observed

Good Credit Correct% X margin lost (by unit) Incorrect % X margin lost (by unit) Observed

(16)

Stopping Conditions

Variety, but choose one is enough

■ Decision Tree Pruning

■ Misclassification error or Deviance

■ _{Select a minimum number of cases for a node to be considered for}

splitting.

■ _{Select a maximum number of total nodes.}

■ FACT direct stopping

(17)

Cross Validation

■ Cross Validation is a method to prevent over fitting the data and failing to generalize to new data.

■ V-Fold Cross Validation

■ _{Good for smaller data sets, when holding out a test sample is not feasible.} ■ _{Repeats the analysis on V different random samples taken from the data}

and compares the resulting trees.

■ Train – Test Sample Cross Validation

■ Test sample data is used to determine if the right size tree was found based on how well the tree performs on the test data.

(18)

Surrogates

■ In deployment stage, surrogate splits are used in place of the actual split variable when its value is missing.

■ The surrogate is the next best split variable.

(19)

CHAID Trees

■ What is CHAID

■ Analysis options

■ Exhaustive CHAID

(20)

CHAID

■ Chi Square Automatic Interaction Detection

■ Performs multi-level splits where C&RT uses binary splits.

■ Well suited for large data sets.

(21)

CHAID Analysis Options

■ ANOVA type design

■ Misclassification costs

■ Cross Validation

■ Bonferroni adjustment

Ri sk esti m ate s (credi t scori ng for m odel bui l di ng.sta) Dependent vari abl e: Credi t Rati ng

Opti ons: Cate gori cal response Ri sk

esti m ate

Standard error

(22)

CHAID Analysis Options

■ Stopping parameters

■ _{Minimum cases for a node to be split} ■ _{Maximum number of total nodes} ■ _{Probability for merging and splitting}

(23)

Exhaustive CHAID

Optional to proceed or NOT

■ More computationally intensive, for large data sets may require extended computations.

■ Performs more thorough merging and testing of predictor variables to find the best split candidate.

(24)

Boosted Trees

■ What are boosted trees?

■ Stopping parameters

(25)

Boosted Trees

■ The idea of Boosted trees is to build a sequence of simple trees, weighting them inversely by

misclassification.

■ The final classification for

deployment is based on voting from these simple trees.

(26)

Boosted Trees Analysis Options

■ Learning Rate

■ Number of additive terms

■ Random test data proportions

■ Subsample proportions

(27)

Boosted Trees Stopping Parameters

■ Minimum number of cases

■ Maximum number of levels

■ Minimum number in child nodes

(28)

Random Forests For Classification

■ What is Random Forests classification?

(29)

Random Forests

■ Random Forests builds a series of trees.

■ Each tree predicts a classification.

(30)

Random Forest Analysis Options

■ Number of predictors - optimal setting is log2(M+1)

■ Number of trees

■ Sampling proportions

(31)

Comparing Performance across Models

■ Generating deployment code

■ Rapid Deployment results

(32)

Generating Deployment Code

Optional topic

■ Many Data Mining and Statistics tools in STATISTICA have the ability to generate deployment code.

■ _{STATISTICA Visual Basic}

■ _{C/C++ language} ■ _{PMML Script}

■ _{Deployment to STATISTICA Enterprise}

(33)

■ Load multiple data mining models

■ Make predictions on new data

■ Generate lift and gains charts comparing models

■ Write predictions back to data file

Rapid Deployment

Optional topic

(34)

■ Lift chart

■ Shows the effectiveness of the model compared to no model.

■ Gains chart

■ Shows the percentage of

observations correctly classified for the given category, in this case,

bad.

Lift and Gains Charts

Optional topic

(35)

Voting Across Models

■ What is voting or bagging?

■ Instability or results in small datasets

■ Reviewing results Input Data C&RT CHAID Random Forest

(36)

What is Voting or Bagging?

■ Data Mining offers a variety of model building tools, so a large number of models can be created in a given project.

■ None of these models will fully capture the underlying relationship of the data.

■ Using an ensemble of models together to determine the final prediction is called voting or bagging.

(37)

What is Voting or Bagging?

_{What is Voting or Bagging?}

Input Data C&RT CHAID Random Forest Boosted Vote Final Prediction

(38)

Instability of Results in Small Data Sets

■ Data Mining can model complex relationships between variables.

■ Without ample data, instability can be an issue.

■ Using multiple models with voting combats the problem of instability in results. The ensemble of models typically outperforms individuals.

(39)

■ Rapid Deployment gives

predictions based on voting. That output can generate other

(40)

Q/A

■ Welcome to mail us!

■ service@statsoft.com.tw