• 沒有找到結果。

Introduction to Credit Risk Data_R1

N/A
N/A
Protected

Academic year: 2021

Share "Introduction to Credit Risk Data_R1"

Copied!
40
0
0

加載中.... (立即查看全文)

全文

(1)

Making the World More Productive™

Software Course and the Case Practice

Introduction of Credit Risk Data

Cheyu HUNG / 洪哲裕

StatSoft Holdings, Inc., Taiwan Branch

(2)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 1

Credit Risk Data

■ The application for this data

■ Variables in the Credit Risk data

(3)

Applications

■ Practically all data need some preparation work.

Handling missing data and outliers. Selecting important variables.

Sampling

■ Classification tasks have many uses

Classify a variable with 2 or more groups Find probability of a predicted classification

(4)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 3

Application for Credit Risk Data

■ A financial institution needs a way to decide if and how much credit to

extend to customers who apply. This is our business need Automated Neural Networks

■ The goals of the data mining project include:

Determining the variables that are best predictors of credit risk,

Finding a high performance predictive model that classifies customers, Deploying that model to make decisions on credit application,

(5)

Next Steps for the Data Mining Project

■ The project goals are expressed and data are available.

■ The next step is to understand the data. We will do this by reviewing the data graphically.

■ Later the data needs prepared using data cleaning tools.

(6)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 5

Introduction to Credit Risk Data

■ In this session, we discussed the business need and application for the data.

■ We reviewed the variables

(7)

Next Steps

■ With an understanding of the application, we are ready to start working with the data.

■ The next session will look at query and import options to bring the data into STATISTICA from a database or external format.

(8)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 7

Initial Graphical Review

■ Review the data graphically to reveal issues with the data that will need addressed in the data cleaning phase.

(9)

Data Cleaning for Outliers

■ Detecting outliers

■ Graphically

Statistical tests

■ Handling outliers

■ What caused the outlier?

■ What can be done to clean the data?

(10)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 9

Detecting Outliers Graphically

■ A box plot can show outliers in continuous data.

■ A histogram can show outliers in categorical data.

(11)

Detecting Outliers with Statistical Tests

Grubbs test

Normal distribution

Percentiles

Tukey

Descriptive Statistics

Variable

Grubbs Test

Statistic

p-value

Age

3.458622 0.546786

(12)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 11

Handling Outliers

Remove the outliers

Are the outliers data entry errors?

Are the outliers due to entries in the data set that

don’t belong there?

Keep the outliers

Are the outliers legitimate points that simply have

extreme values?

(13)

C&RT for Classification

■ What is C&RT ■ Misclassification cost ■ Stopping conditions ■ Cross Validation ■ Surrogates

Suppose these terms above are known and learned

(14)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 13

C&RT

■ Classification and Regression Trees

■ A nonparametric data mining algorithm for generating decision trees.

■ Splits are made by variables that best differentiate the target variable.

■ Each node can be split into two child nodes

(15)

Misclassification cost

■ Misclassification is inevitable since no model is perfect.

■ Some misclassifications are worse than others, so STATISTICA allows you to account for this with misclassification costs.

Predicted

Good Credit Predicted Bad Credit Observed

Good Credit Correct% X margin lost (by unit) Incorrect % X margin lost (by unit) Observed

(16)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 15

Stopping Conditions

Variety, but choose one is enough

■ Decision Tree Pruning

■ Misclassification error or Deviance

Select a minimum number of cases for a node to be considered for

splitting.

Select a maximum number of total nodes.

■ FACT direct stopping

(17)

Cross Validation

■ Cross Validation is a method to prevent over fitting the data and failing to generalize to new data.

■ V-Fold Cross Validation

Good for smaller data sets, when holding out a test sample is not feasible. Repeats the analysis on V different random samples taken from the data

and compares the resulting trees.

■ Train – Test Sample Cross Validation

■ Test sample data is used to determine if the right size tree was found based on how well the tree performs on the test data.

(18)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 17

Surrogates

■ In deployment stage, surrogate splits are used in place of the actual split variable when its value is missing.

■ The surrogate is the next best split variable.

(19)

CHAID Trees

■ What is CHAID

■ Analysis options

■ Exhaustive CHAID

(20)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 19

CHAID

■ Chi Square Automatic Interaction Detection

■ Performs multi-level splits where C&RT uses binary splits.

■ Well suited for large data sets.

(21)

CHAID Analysis Options

■ ANOVA type design

■ Misclassification costs

■ Cross Validation

■ Bonferroni adjustment

Ri sk esti m ate s (credi t scori ng for m odel bui l di ng.sta) Dependent vari abl e: Credi t Rati ng

Opti ons: Cate gori cal response Ri sk

esti m ate

Standard error

(22)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 21

CHAID Analysis Options

■ Stopping parameters

Minimum cases for a node to be split Maximum number of total nodes Probability for merging and splitting

(23)

Exhaustive CHAID

Optional to proceed or NOT

■ More computationally intensive, for large data sets may require extended computations.

■ Performs more thorough merging and testing of predictor variables to find the best split candidate.

(24)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 23

Boosted Trees

■ What are boosted trees?

■ Analysis options

■ Stopping parameters

(25)

Boosted Trees

■ The idea of Boosted trees is to build a sequence of simple trees, weighting them inversely by

misclassification.

■ The final classification for

deployment is based on voting from these simple trees.

(26)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 25

Boosted Trees Analysis Options

■ Learning Rate

■ Number of additive terms

■ Random test data proportions

■ Subsample proportions

(27)

Boosted Trees Stopping Parameters

■ Minimum number of cases

■ Maximum number of levels

■ Minimum number in child nodes

(28)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 27

Random Forests For Classification

■ What is Random Forests classification?

■ Analysis options

(29)

Random Forests

■ Random Forests builds a series of trees.

■ Each tree predicts a classification.

(30)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 29

Random Forest Analysis Options

■ Number of predictors - optimal setting is log2(M+1)

■ Number of trees

■ Sampling proportions

(31)

Comparing Performance across Models

■ Generating deployment code

■ Rapid Deployment results

(32)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 31

Generating Deployment Code

Optional topic

Many Data Mining and Statistics tools in STATISTICA have the ability to generate deployment code.

STATISTICA Visual Basic

C/C++ language PMML Script

Deployment to STATISTICA Enterprise

(33)

■ Load multiple data mining models

■ Make predictions on new data

■ Generate lift and gains charts comparing models

■ Write predictions back to data file

Rapid Deployment

Optional topic

(34)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 33

■ Lift chart

■ Shows the effectiveness of the model compared to no model.

■ Gains chart

■ Shows the percentage of

observations correctly classified for the given category, in this case,

bad.

Lift and Gains Charts

Optional topic

(35)

Voting Across Models

■ What is voting or bagging?

■ Instability or results in small datasets

■ Reviewing results Input Data C&RT CHAID Random Forest

(36)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 35

What is Voting or Bagging?

■ Data Mining offers a variety of model building tools, so a large number of models can be created in a given project.

■ None of these models will fully capture the underlying relationship of the data.

■ Using an ensemble of models together to determine the final prediction is called voting or bagging.

(37)

What is Voting or Bagging?

What is Voting or Bagging?

Input Data C&RT CHAID Random Forest Boosted Vote Final Prediction

(38)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 37

Instability of Results in Small Data Sets

■ Data Mining can model complex relationships between variables.

■ Without ample data, instability can be an issue.

■ Using multiple models with voting combats the problem of instability in results. The ensemble of models typically outperforms individuals.

(39)

■ Rapid Deployment gives

predictions based on voting. That output can generate other

(40)

© Copyright StatSoft, Inc., 1984-2013. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 39

Q/A

■ Welcome to mail us!

■ service@statsoft.com.tw

參考文獻

相關文件

Reading Task 6: Genre Structure and Language Features. • Now let’s look at how language features (e.g. sentence patterns) are connected to the structure

Now, nearly all of the current flows through wire S since it has a much lower resistance than the light bulb. The light bulb does not glow because the current flowing through it

This kind of algorithm has also been a powerful tool for solving many other optimization problems, including symmetric cone complementarity problems [15, 16, 20–22], symmetric

• But, If the representation of the data type is changed, the program needs to be verified, revised, or completely re- written... Abstract

To complete the “plumbing” of associating our vertex data with variables in our shader programs, you need to tell WebGL where in our buffer object to find the vertex data, and

Following the supply by the school of a copy of personal data in compliance with a data access request, the requestor is entitled to ask for correction of the personal data

Discovering the City by Mining Diverse and Multimodal Data Streams – IBM Grand Challenge: New York City 360. §  Exploring and Integrating Multiple Contents and Sources for

We showed that the BCDM is a unifying model in that conceptual instances could be mapped into instances of five existing bitemporal representational data models: a first normal