CHAPTER II: SEMIPARAMETRIC BANKRUPTCY PREDICTION METHODS
2.3 A Real Data Example
In this section, a real case-control data set is analyzed using our method SLM and prediction rules DAM, LLM and KMV. McKee (2003) pointed out that company asset size and industry are significant factors affecting bankruptcy status. Thus an ideal approach is to stratify companies according to industry and asset size and determine prediction model for each stratum. Unfortunately, we did not have enough data from COMPUSTAT and CRSP databases for doing so. Thus, to illustrate our method, we simply used two controls to match with one case so that they had the same standard industrial classification (SIC) code and similar company asset size from the same year.
By doing this, it is clear that the company asset size has no more power in discriminating the bankruptcy status of the company and thus will not be included in the analysis of our example.
We now introduce the case-control data set. The data set contains 79 companies that were delisted and declared bankruptcy (cases) during the period 1994 to 2002 by COMPUSTAT as meeting the Chapter 11 Bankruptcy or Chapter 7 Liquidation. Af-ter identifying these companies filing for bankruptcy, both COMPUSTAT and CRSP databases were searched to locate the latest annual financial data prior to the delist-ing date. Thus the annual financial data for the identified bankrupt companies were from the period 1993 to 2001. Among the 79 selected bankrupt companies, each was matched with two nonbankrupt companies, except 2 companies only matched with one
Table 1: The SIC codes of companies in our case-control sample.
SIC category number of bankrupt companies
number of nonbankrupt companies
1000− 1999 4 8
2000− 2999 11 22
3000− 3999 21 40
4000− 4999 5 10
5000− 5999 18 36
6000− 6999 3 6
7000− 7999 13 26
8000− 8999 4 8
Total companies: 79 156
nonbankrupt company each, due to the incompleteness of the two databases. Hence our data set also contains 156 nonbankrupt companies (controls). The total number of companies in this research was n = 235. The financial institutions were eliminated from the sample due to the unique capital requirements and regulatory structure in that industry group.
We note that COMPUSTAT provides 233 companies whose common stocks were traded in New York Stock Exchange, American Stock Exchange or NASDAQ, and which were declared bankrupt during the period 1994 to 2002. But since COMPUSTAT and CRSP databases contain many missing values for the predictors studied in our example, we only found 79 bankrupt companies with complete predictor values. There is no additional criteria imposed to the bankrupt companies in our case-control sample. The problem of missing data is not unusual in applications, especially when there are many predictive variables used in the model. As long as the missingness occurs “at random”
then it will not introduce systematic biases in our analyses (Little and Rubin, 2002).
We have no reason not to believe that the missingness occurred in COMPUSTAT and CRSP databases is “missing at random”.
The information about industry and that about company asset size of the selected companies are given in Tables 1 and 2, respectively. The two-sample median test was performed to test the null hypothesis of equal magnitude of the asset size for nonbankrupt company and that for bankrupt company. The p-value given in Table
Table 2: Summary statistics of company asset sizes (in million US dollars) from our case-control sample.
79 bankrupt companies
156 nonbankrupt companies
median-stat (p-value)
mean 105.103 150.508 0.092 (0.927)
median 32.211 33.599
std 290.254 808.741
min 1.447 1.636
max 2345.800 9794.400
2 shows that there is no significant difference between both company asset sizes at significance level 0.05. This result indicates that our matching process has successfully created similar asset sizes for bankrupt and nonbankrupt companies in our case-control sample.
For predicting bankruptcy, the values of the 9 variables used by Ohlson (1980) and the 2 variables suggested by Shumway (2001) were collected for our selected companies from COMPUSTAT and CRSP databases. The 11 predictive variables are as follows:
1. TLTA = Total liabilities divided by total assets.
2. WCTA = Working capital divided by total assets.
3. CLCA = Current liabilities divided by current assets.
4. NITA = Net income divided by total assets.
5. FUTL = Funds provided by operations divided by total liabilities.
6. CHIN = (N It− NIt−1) / (|NIt| + |NIt−1|), where NIt is net income for the most recent period.
7. INTWO = One if net income was negative for the last two years, zero otherwise.
8. OENEG = One if total liabilities exceed total assets, zero otherwise.
9. SIZE = Logarithm of total asset divided by GNP price-level index. The index assumes a base value of 100 for 1991.
10. Relative Size = Logarithm of each firm’s market equity value divided by the total NYSE / AMEX / NASDAQ market equity value.
11. Excess Return = Monthly return on the firm minus the value-weighted CRSP NYSE / AMEX / NASDAQ index return cumulated to obtain the yearly return.
Table 3: Summary statistics of variables in our case-control sample.
variable mean median std min max median-stat
(p-value) 79 bankrupt companies
TLTA 0.801 0.747 0.435 0.020 2.450 −5.432 (0.000) WCTA 0.040 0.075 0.387 −1.192 0.980 4.511 (0.000)
CLCA 1.711 0.857 3.545 0.020 23.214 −4.603 (0.000) NITA −0.423 −0.161 0.649 −2.833 0.182 5.891 (0.000) FUTL −0.335 −0.051 0.921 −4.953 1.279 5.339 (0.000) CHIN −0.251 −0.363 0.655 −1.000 1.000 3.130 (0.002)
INTWO 0.570 1 0.498 0 1 −4.612 (0.000)
OENEG 0.190 0 0.395 0 1 −3.844 (0.000)
Excess Return −0.254 −0.634 1.258 −1.320 6.617 3.682 (0.000) Relative Size −5.803 −5.830 0.675 −7.379 −4.577 4.234 (0.000) πKM V 0.413 0.331 0.383 0.000 1.000 −6.537 (0.000)
156 nonbankrupt companies TLTA 0.486 0.478 0.273 0.029 1.926 WCTA 0.276 0.291 0.258 −0.592 0.921 CLCA 0.707 0.509 0.796 0.055 6.904 NITA −0.079 0.024 0.386 −3.800 0.249 FUTL −0.030 0.110 0.715 −3.387 2.544 CHIN −0.015 0.052 0.573 −1.000 1.000
INTWO 0.263 0 0.442 0 1
OENEG 0.038 0 0.193 0 1
Excess Return −0.131 −0.289 0.631 −1.246 2.503 Relative Size −5.284 −5.320 0.659 −6.838 −2.821
πKM V 0.114 0.001 0.241 0.000 0.989
Note that Ohlson (1980) suggested using the first 9 variables as predictive variables.
But in this dissertation we only used the first 8 variables as the predictive variables in our case-control data analysis. The 9th variable, SIZE, was not used as a predictive variable because the total asset had already been used as the matching factor in the process of selecting the case-control sample for study. The last 2 variables are the market-driven variables used in Shumway (2001).
Pairwise scatter diagrams of our case-control sample on the continuous variables are presented in Figure 1. From the figure, it is clear that the distributions of these variables are fat-tailed and skewed, and it is very difficult to perform bankruptcy prediction visually, since most data points are clustered together.
The summary statistics of the 10 predictive variables considered in our case-control
data analysis are presented in Table 3. For each of these 10 variables, the two-sample median test was performed to test the null hypothesis of equal magnitude for nonbank-rupt company and for banknonbank-rupt company. The p-value in Table 3 shows that the null hypothesis of equal magnitude for cases and controls is significant at 0.05 level for each predictive variable. This result indicates that each of these variables should be an effec-tive prediceffec-tive variable. On the other hand, the summary statistics and the frequency distribution of the values of πKMV for the selected companies in our case-control data analysis are shown respectively in Table 3 and Figure 2. The results also indicate that πKMV has good predictive power.
Given our case-control sample, the bankruptcy prediction rules associated with DAM, LLM, KMV and SLM were estimated. Their performance was measured by the out-of-sample error rate. The out-of-sample error rate was computed on each of the 100testing samples randomly selected from the given case-control sample. Each testing sample was composed of 50% of bankrupt companies and their matched nonbankrupt companies. The data not included in the testing sample were taken as the training sample, and were used to develop the prediction rule.
Under SLM, kernel function K was taken as the Epanechnikov kernel K(u) = (3/4) (1−u2) I(|u| ≤ 1). To compute the out-of-sample error rate for the prediction rule based on SLM on each testing sample, the procedure given in Remark 2.2 for computing the in-sample total error rate τin(p, bθ, bH) = αin(p, bθ, bH) + βin(p, bθ, bH)on the training sample was applied to choose the values of (p, bθ, bH). We computed τin(p, bθ, bH) on the equally spaced logarithmic grid of 1001 × 501 × 501 values of (p, bθ, bH) in [0, 1]× [1/10, 15] × [1/10, 15]. Given each value of u ∈ [0, 1], the global minimizer {ˆp(u), ˆbθ(u), ˆbH(u)} of τin(p, bθ, bH)on the grid points with the restrictions αin(p, bθ, bH)≤ u and bH > bθ was taken as the selected values for (p, bθ, bH).
Using the selected values of {ˆp(u), ˆbθ(u), ˆbH(u)} and the training sample, the values H(xˆ j) and ˆθ were computed for each data point (xj, zj) in the testing sample. The company with the predictor values (xj, zj) in the testing sample was classified as a
Figure 1: Pairwise scatter diagrams of our case-control sample. Given the values of Shumway’s 2 market-driven variables, Excess Return and Relative Size, and Ohlson’s 6 continuous variables in our case-control sample, their pairwise scatter diagrams are presented. Each panel plots 156 nonbankrupt companies (pluses) and 79 bankrupt com-panies (stars) selected from COMPUSTAT and CRSP databases.
Figure 2: The frequency histogram of the values of πKM V in our case-control sample.
The frequency histogram of the values of πKM V for the 156 nonbankrupt companies, and that for the 79 bankrupt companies in our case-control sample are plotted in the left and the right panels, respectively.
bankrupt company if
ψˆj = exp{ ˆH(xj) + ˆθ zj}
1 + exp{ ˆH(xj) + ˆθ zj} > ˆp(u),
otherwise a healthy company. After the classification procedure was completed for each company in the testing sample, the out-of-sample error rates
αSLM(u) = P
j:(xj,zj)in testing sample Yj I{ˆψj ≤ ˆp(u)} P
j:(xj,zj)in testing sample Yj
,
βSLM(u) = P
j:(xj,zj)in testing sample (1− Yj) I{ˆψj > ˆp(u)} P
j:(xj,zj)in testing sample (1− Yj) , τSLM(u) = αSLM(u) + βSLM(u),
of the bankruptcy prediction rule based on SLM were computed, for each given value of u. For the given value of u, αSLM(u) is the out-of-sample type I error rate of classifying the bankrupt companies to healthy ones, and βSLM(u)is the out-of-sample type II error rate of classifying the healthy companies to bankrupt ones from the testing sample. After the computational procedure was completed for each testing sample, the average of each out-of-sample error rate over the 100 testing samples was computed.
Figure 3: The out-of-sample error rates obtained by applying KMV, DAM, LLM, and SLM to our case-control sample. Panels (a)-(c) show three out-of-sample error rates of the prediction methods derived from one testing sample. Panels (d)-(f) show sample averages of the three out-of-sample error rates over the 100 testing samples. Each testing sample was composed of 50% of bankrupt companies and their matched nonbankrupt companies in our case-control sample.
The same computational procedures were applied to the prediction rules based on DAM, LLM and KMV. Let {αDAM(u), βDAM(u), τDAM(u)}, {αLLM(u), βLLM(u), τLLM(u)} and {αKM V(u), βKMV(u), τKM V(u)} be similarly defined as the out-of-sample error rates for DAM, LLM and KMV. The prediction results obtained by apply-ing the four discussed bankruptcy prediction rules to our case-control data are shown in Figure 3 and Table 4.
Figure 3 presents the three (averaged) out-of-sample error rates for the four predic-tion models under one (one hundred) testing sample(s). These error rates were derived under the constraint that the type I error rate was at most u. If no such constraint is required, we simply take u = 1 and the related out-of-sample error rates are given in Table 4. For the case of u = 1, both SLM and KMV give smaller out-of-sample type I error rates than DAM and LLM. Nevertheless, KMV has the largest out-of-sample type II error rate among the four competing prediction rules. DAM and LLM show rather similar behavior in the sense of having almost the same averaged out-of-sample types I
Table 4: Numerical results of the out-of-sample error rates obtained by applying KMV, DAM, LLM, and SLM to our case-control sample. Given the value of u = 1, the values of the three out-of-sample error rates shown in (a)-(c) of Figure 3 are presented, and those shown in (d)-(f) of Figure 3 are given in parentheses.
KMV DAM LLM SLM
type I error rate 0.250 (0.253) 0.375 (0.290) 0.350 (0.296) 0.200 (0.202) type II error rate 0.405 (0.328) 0.241 (0.278) 0.228 (0.287) 0.291 (0.321) total error rate 0.655 (0.581) 0.616 (0.568) 0.578 (0.583) 0.491 (0.523) has the best overall performance. Thus it is fair to say that by a reasonable margin, the most accurate model listed in Table 4 is the SLM.
From Figure 3, we find out that the similar conclusions as those shown in Table 4 can also be reached. For u ≤ 0.2, KMV has the smallest averaged out-of-sample type I error rate. However, it also has the largest averaged type II error rate in this range.
For u > 0.2, KMV has similar averaged type II error rate as SLM but larger type I error rate than SLM. For u ∈ [0, 1], DAM and LLM show very similar performance.
However, comparing the four prediction rules based on averaged out-of-sample total error rate, Figure 3 shows that SLM has the best overall performance.