A semiparametric method for predicting bankruptcy

(1)

A Semiparametric Method for

Predicting Bankruptcy

RUEY-CHING HWANG,1,2, * K. F. CHENG3,4 AND JACK C. LEE5 1

Department of Management Science, National Chiao Tung University, Hsinchu, Taiwan

2

Department of Finance and Banking, Dahan Institute of Technology, Hualien, Taiwan

3

Institute of Statistics, National Central University, Jhongli, Taiwan

4

Biostatistical Center, China Medical University, Taichung, Taiwan

5

Graduate Institute of Finance, National Chiao Tung University, Hsinchu, Taiwan

ABSTRACT

Bankruptcy prediction methods based on a semiparametric logit model are pro-posed for simple random (prospective) and case–control (choice-based; retro-spective) data. The unknown parameters and prediction probabilities in the model are estimated by the local likelihood approach, and the resulting esti-mators are analyzed through their asymptotic biases and variances. The semi-parametric bankruptcy prediction methods using these two types of data are shown to be essentially equivalent. Thus our proposed prediction model can be directly applied to data sampled from the two important designs. One real data example and simulations confirm that our prediction method is more powerful than alternatives, in the sense of yielding smaller out-of-sample error rates. Copyright © 2007 John Wiley & Sons, Ltd.

key words linear logit model; out-of-sample error rate; semiparametric logit model

INTRODUCTION

Academics, practitioners and regulators have routinely used models to predict the bankruptcy of companies. For example, the discriminant analysis model (DAM) has been a popular technique for studying the financial health of a company; see Altman (1968). Other frequently referred models include those by Ohlson (1980) and Zmijewski (1984). The former bankruptcy prediction method is based on a linear logit model (LLM). The latter, on the other hand, is based on a probit model. Grice and Dugan (2001) recently cautioned the routine application of these two probabilistic models

(www.interscience.wiley.com) DOI: 10.1002/for.1027

* Correspondence to: Ruey-Ching Hwang, Department of Finance and Banking, Dahan Institute of Technology, 1 Su-Ren Street, Hualien 97145, Taiwan.

(2)

of bankruptcy. Their study showed that using prediction models to time periods and industries other than those used to develop the models may result in significant decline in prediction accuracies.

Bankruptcy prediction methods using other models or concepts include, for example, the recur-sive partition model (Frydman et al., 1985), expert systems (Messier and Hansen, 1988), chaos theory (Lindsay and Campbell, 1996), neural networks (Koh and Tan, 1999), survival analysis (Lane et al., 1986; Shumway, 2001; Chava and Jarrow, 2004), rough set theory (McKee, 2003), KMV–Merton model (KMV; Bharath and Shumway, 2004; Vassalou and Xing, 2004) and support vector machines (Härdle et al., 2006).

Ohlson (1980) stated that the main reason of using the LLM is its simplicity in computation and interpretation. There are many software packages having logistic regression capabilities, for example, BMDP, EGRET, GLIM and SAS. Thus LLM can be easily updated or revised as long as there are new observations of the same predictors or new predictive variables available for analysis. Another interesting reason, not stated in Ohlson (1980), for using the LLM is that the approach gives us a great flexibility in using different sampling schemes for collecting observations for building predic-tion models. Farewell (1979) and Prentice and Pyke (1979) justified that under the LLM one can analyze case–control (choice-based; retrospective) observations as if they were sampled from a simple random sampling scheme (prospective observations). The observations in Ohlson’s empiri-cal study were case–control observations. His prediction model had been estimated as if the obser-vations were prospective obserobser-vations (but without any justification).

The case–control data for bankruptcy prediction are composed of two simple random samples. One is selected from the population of bankrupt companies, and called the case sample. The other is selected from the population of nonbankrupt companies, and called the control sample. An important special case of the case–control study is the stratified (matched) case–control study. In the latter study, the numbers of cases (bankrupt companies) and controls (nonbankrupt companies) are matched according to some stratifying variables. Usually, in most matched case–control designs, each case is matched with one to five controls per stratum. For a detailed introduction of the LLM and the (matched) case–control data, see, for example, the monograph by Hosmer and Lemeshow (1989).

Applying case–control data to the LLM, Prentice and Pyke (1979) showed that the resulting esti-mators of the coefficients converge to their true values, except the intercept estimator, as both sample sizes of control and case data become large. Due to the incorrect estimation of the intercept, the ‘estimated’ bankruptcy probability, obtained by plugging all these estimates of coefficients into the LLM, does not converge to the true bankruptcy probability. However, since the logistic distribution is a strictly monotone increasing function, such estimated bankruptcy probability still can be employed to develop a prediction device as the LLM does under the prospective data. For further discussion of these facts, see the next section.

Ohlson’s bankruptcy prediction model postulates that the logit function of bankruptcy probabil-ity is a linear function of the predictors. Nine predictors were selected for developing his model because they appeared to be the ones most frequently mentioned in the literature. One potential pitfall of this model is that it assumes a linear relation between the predictors and the logit function of bankruptcy probability. This approach in general is not robust with respect to the misspecification of the linear relation. See the discussion and Figure 2 of Härdle et al. (2006). Their results show that the relation between bankruptcy probability and predictors, such as net income change and company size, may not be monotonic.

The main focus of this paper is to consider a robust method, against misspecification of the rela-tion between the predictors and the logit of bankruptcy probability, by introducing a semiparamet-ric logit model (SLM; Zhao et al., 1996) for predicting bankruptcy. This model is basically very

(3)

similar to the LLM, except that some unspecified function H(·) replaces the linear function to model the relation between the predictors and the logit of bankruptcy probability. Thus, clearly, the SLM is much more general and flexible in predicting the bankruptcy of a firm. Härdle et al. (2006) also propose a flexible but fully nonparametric approach for predicting bankruptcy. They use support vector machines to generate nonlinear score function of predictors, and then employ a nonparamet-ric technique to map scores into bankruptcy probabilities. Their work presents a new trend in bank-ruptcy analysis.

In LLM, there are a finite number of coefficients and thus the usual likelihood method can be applied to estimate the coefficients. In contrast, conceptually, there are infinitely many parameters in the unspecified function H(·) in SLM. We shall use the local likelihood method (Tibshirani and Hastie, 1987; Wand and Jones, 1995) to estimate the unknown function H(·) and hence develop a bankruptcy prediction method. Specifically, given the SLM and any particular predictor value x0, the quantity H(x0) can be estimated using weighted linear logistic regression with weights determined by the local neighborhood of x0. Thus the application of the SLM is rather straightforward and efficient.

We may consider SLM as an extension of LLM. This is a useful extension, since in many prac-tical applications it is not easy to confirm the relation between the logit and a particular set of pre-dictors. Usually, linear function only serves as an approximation and it may not be good enough for the true model. It is possible, for a particular set of predictors, to add polynomial terms in the linear function to achieve the same superior fit that the SLM provides. However, this polynomial model may not be suitable for other sets of predictive variables. Further, we require a large dataset and powerful lack-of-fit tests to confirm the polynomial model to be appropriate. In contrast, our SLM provides a useful alternative approach. Using this approach, we do not need strategies for building the relation between the logit and the predictors, and it is quite simple to apply.

The remainder of the paper is organized as follows. In next section, our methodology for predict-ing bankruptcy based on SLM is developed uspredict-ing concepts similar to those based on LLM. The third section describes one real case–control dataset and provides some summary statistics. The summary statistics show that the predictors under consideration have reasonable power in discriminating the bankruptcy status of the company. The real dataset is analyzed using SLM and the alternative methods such as DAM, LLM and KMV. The prediction ability of each method is described by the out-of-sample error rate. The results about out-of-out-of-sample error rates are summarized in the third section. Fur-thermore, the out-of-sample error rate results from some Monte Carlo simulations are given in the fourth section. From comparisons of these results, we can confirm that SLM has the best performance. The fifth section contains concluding remarks and future research topics. Our theoretical results are presented in Appendix A. Finally, sketches of the proofs are given in Appendix B.

METHODOLOGY

In this section we describe the formulation of SLM, develop the methodology for estimating its unknown quantities, and build a bankruptcy prediction method based on SLM for both simple random and case–control data.

Linear logit model versus semiparametric logit model

Most bankruptcy prediction methods were developed on training samples. Usually, the training sample consists of the data of n companies collected for some time period by a simple random sampling scheme. For the ith company, i = 1, . . . , n, we observe values (Di, xi, zi) where Di= 1

(4)

indicates that the ith company is in the state of bankruptcy and 0, otherwise, and xi= (xi1, . . . , xid)T

and zi= (zi1, . . . , ziq)Tare values of the vectors of explanatory variables used to forecast failure. Here

we have used X and Z, respectively, to represent the continuous and discrete variables, and employed the upper index T to stand for the transpose of a vector. For example, in Ohlson (1980), there were nine financial variables being used for developing his bankruptcy prediction model. Among these explanatory variables, seven (= d) of them are continuous variables and two (= q) are binary variables.

Given the prospective training sample and the values (x, z) of the predictors (X, Z), the LLM is defined by assuming the bankruptcy probability to be

or written in the form of the logit function of bankruptcy probability:

Here h, ς and q are 1 × 1, 1 × d and 1 × q vectors of logistic parameters, respectively. For the company with predictor values (x0, z0), its predicted bankruptcy probability

is the logistic distribution evaluated at the predicted score hˆ+ ςˆx0+ qˆz0, where hˆ, ςˆ and qˆ are the maximum likelihood estimates based on the prospective training sample {(Di, xi, zi), i= 1, . . . , n}.

The main advantage of the LLM lies in its simplicity of computation and interpretation, but the model may not be efficient for the purpose of prediction. Sometimes, based on previous experience, there are reasons for modeling the logit function of bankruptcy probability as a particular function of (X, Z), which may not be linear. However, there is a general drawback to such parametric mod-eling. If one chooses a parametric family that is not of appropriate form, at least approximately, then the resulting model-based bankruptcy probability prediction might not correctly estimate the true bankruptcy probability, and there is a danger of reaching erroneous prediction.

The limitation of LLM can be overcome by removing the restriction that the logit function of p(D= 1 | X = x, Z = z) belongs to a parametric family. This approach may lead to the following SLM:

(1)

Here, we only assume H(x) to be a smooth function of the value x of the continuous predictor X; otherwise, it is not specified. Clearly, this is a very flexible prediction model. For the company with predictor values (x0, z0), its predicted probability of bankruptcy is thus defined as

(2) ˆ exp ˆ ˆ exp ˆ ˆ , p D X x Z z H x z H x z = = = ( )=

{

( )+

}

+

{

( )+

}

1 1 0 0 0 0 0 0 , q q p D X x Z z H x z H x z = = = ( )=

{

( )+

}

+

{

( )+

}

1 1 , exp exp . q q ˆ exp ˆ ˆ ˆ exp ˆ ˆ ˆ p D X x Z z x z x z = = = ( )=

(

+ +

)

+

(

+ +

)

1 1 0 0 0 0 0 0 , h V q h V q logit p D X x Z z log p D X x Z z . p D X x Z z x z = = = ( )

{

}

= ( = = = ) − ( = = = )   = + + 1 1 1 1 , , , h V q p D X x Z z x z x z = = = ( )= ( + + ) + ( + + ) 1 1 , exp exp , h V q h V q

(5)

the logistic distribution evaluated at the predicted score Hˆ(x0) + qˆz0. Here Hˆ(x0) and qˆ are estimates deriving from the prospective training sample and local likelihood method.

There exists many methods for estimating H(x0), for x0= (x01, . . . , x0d)T. One of these methods with a simple idea is the local likelihood method (see Tibshirani and Hastie, 1987). This approach is to first choose a positive scalar constant b and define a neighborhood of x0as N(x0; b) = {t = (t1, . . . , td)T: |tj− x0j| ≤ b, for each j = 1, . . . , d}. Then the idea of the local likelihood method is to apply both concepts of the likelihood method using partial sample S(x0; b) = {(Di, xi, zi) : xi∈ N(x0; b)}, and the first-order Taylor approximation:

for xi∈ N(x0; b). Here the parameters a and b stand for the unknown quantities H(x0) and H(1)(x0)T, respectively, and H(1)

(x0) is the column vector of partial derivatives of H(x0). Specifically, we find the maximizer (aˆ, bˆ, qˆ) of the ‘local’ log-likelihood function

(3)

use Hˆ(x0) = aˆ to estimate H(x0), and combine estimates Hˆ(x0) and qˆ to develop a prediction of bank-ruptcy probability (2).

Note that the concept of local inference is well established in regression analysis; see also Wand and Jones (1995). There are two major strategies considered in the local likelihood approach: using linear approximation (the first-order Taylor approximation) for each H(xi) with xi∈ N(x0; b), and using the partial (local) sample S(x0; b) to derive the maximum local likelihood estimates. This method is directly analogous to the LLM, except that here we have used the concept of local fitting. We now determine a p* ∈ [0, 1] value to make bankruptcy prediction for the company with pre-dictor values (x0, z0). That is, if its predicted bankruptcy probability pˆ(D= 1 | X = x0, Z= z0) derived from (2) and (3) satisfies

then the company is classified to be in the status of bankruptcy; otherwise it is classified as a healthy company. To decide a proper cut-off point p*, usually one would use the training sample to evalu-ate the performance of the classification scheme. In doing so, there are two types of ‘in-sample’ error rate occurred in this evaluation based on the training sample

Here I(·) stands for the indicator function.

Using the training sample and the cut-off point p, ain(p) is the rate of misclassifying a bankrupt

company as a healthy company, and bin(p) is the rate of misclassifying a healthy company as a

bank-rupt company. To keep these error rates as small as possible, we might determine a proper cut-off point p* such that

type II error rate bin , .

in i i i i n i p D I p D X x Z z p D ( )= ∑ ( − )

{

( = = = )>

}

∑ ( − ) = = 1 1 1 1 1 ˆ

type I error rate ain , ,

i n i i i i n i p D I p D X x Z z p D ( )= ∑

{

( = = = )≤

}

∑ = = 1 1 1 ˆ ˆ *, p D( =1 X=x0,Z=z0)> p l = −( )

[

+

{

+ ( − )+

}

]

+

{

+ ( − )+

}

∈

∑

( ) ∈

∑

( ) 1 1 0 0 0 0 0 0 log exp a b xi x qz D a b x x qz , i x N x b i i i x N x b i i : ; : ; H xi H x H x x x x x T i i ( )≈ ( )+ ( )( ) ( − )≡ + ( − ) 0 1 0 0 a b 0 ,

(6)

That is to control the in-sample type I error rate ain(p) to be at most u, so that the sum of the two

in-sample error rates tin(p) = ain(p) + bin(p) is minimal. This is essential if the type I error would

cause much more severe losses to the investors. On the other hand, if classifying healthy firms as being bankrupt would cause more severe losses to the investors, we might control the in-sample type II error rate bin(p) to be at most u. In practice, the value of u∈ [0, 1] is determined by the investor.

If u= 1, then there is no restriction on the magnitude of in-sample type I and II error rates (Altman, 1968; Ohlson, 1980; Begley et al., 1996). Since the value of p* depends on that of u, it is also denoted by p*(u).

More general estimates of bankruptcy probability

Suppose we define K(u) to be the uniform probability density function over [−1, 1], then the local log-likelihood function (3) can be expressed in a different form:

(4)

where Kb(xi− x0) = Πdj=1K{(xij− x0j)/b}. From this expression we see that the local log-likelihood function (4) can be considered as a weighted log-likelihood function which gives weight 1 to the data inside the neighborhood sample S(x0; b) and weight 0 outside. Conceptually, a different weight-ing scheme can also be employed for definweight-ing a different weighted log-likelihood function, so that larger weights are given to data points with X values closer to x0and smaller weights to those with X values far from x0. This can be achieved by introducing a unimodal probability density function that is symmetric about 0 to replace the uniform density for K(·) in (4). However, the results from the literature show that the choice of the density function K(·), also called the kernel function, is not very important in the local fitting. A popular choice of K(·) is the Epanechnikov kernel defined as K(u) = (3/4)(1 − u2_)I(|u|_{≤ 1) (see Wand and Jones, 1995), due to its computational convenience and} optimal performance (for example, it minimizes mean square error among all non-negative kernel functions).

Refined estimates of bankruptcy probability

It was pointed out in Wand and Jones (1995) that the estimator of bankruptcy probability produced from (2) and (4) is consistent under very general conditions. However, our theoretical results in Appendix A show that more refined estimators of H(x0) and q, in terms of smaller asymptotic mean square error, can be achieved. The computation procedure of the refined estimates includes the fol-lowing three steps.

Step 1: For each xi, i= 1, . . . , n, use the local log-likelihood function (4) with the Epanechnikov

kernel to compute the estimate Hˆ(xi).

Step 2: Replace the unknown quantities H(xi) in model (1) with their estimates Hˆ(xi), for i= 1, . . . ,

n, fit the bankruptcy probability by the resulting model

l = −( )

[

+

{

+ ( − )+

}

]

( − ) +

{

+ ( − )+

}

( − ) = =

∑

1 1 0 0 1 0 0 0 1 0 log exp , a b q a b q x x z K x x D x x z K x x i i n b i i i i n b i t a b a b a in in in p p u in in p p p p p in * * * min . ( )= ( )+ ( )= _∈_{[ ]}_{0,1 ,} _{( )≤}

{

( )+ ( )

}

(7)

and use the training sample to maximize the corresponding pseudo log-likelihood function with respect to q0and q. Here q0is a normalizing constant which makes the bankruptcy probability function be integrated to 1. Let the maximum likelihood estimate of q be denoted as qˆR, the

refined estimate of q.

Step 3: Replace q in the local log-likelihood function (4) by qˆRand use the training sample to

max-imize the resulting pseudo local log-likelihood function with respect to a and b. Let (aˆR, bˆR) be

the maximizer, then the refined estimate of H(x0) is HˆR(x0) = aˆR.

The refined predicted bankruptcy probability

is the logistic distribution evaluated at the predicted score HˆR(x0) + qˆRz0. Note that this new

pro-cedure is no more complicated than the propro-cedure we discussed earlier. Since the computation of qˆRdepends only on the training sample, we consider qˆRas a constant in the prediction system.

For any new values (x0, z0) of the predictors (X, Z), we only need to apply Step 3 to compute HˆR(x0) and then the predicted bankruptcy probability.

The selection of constant b

The practical implementation of SLM requires the specification of the constant b, also called the bandwidth. It determines how many data points should be included in the SLM. From the last two subsections, we see that the same bandwidth has been used for computing the refined estimates HˆR(x0) and qˆR. Alternatively, we may consider using bandwidth bq to compute qˆR (Steps 1 and 2), and

employing bandwidth bHto compute HˆR(x0) (Step 3). The optimal values b*q and b*Hof bqand bH,

respectively, can be determined by minimizing the mean square errors of the resulting qˆRand HˆR.

Theoretical results in Appendix A show that b*His of larger order in magnitude than b*q. Although such theoretical results give some indication on how to select bandwidth parameters bqand bH, they

are not available in practice, since they depend on the unknown H(·), q and density function of the predictors. Thus, in real applications, we would suggest considering the in-sample type I and II error rates defined in the first subsection as functions of p, bqand bH, denoted as ain(p, bq, bH) and bin(p,

bq, bH), respectively. The bandwidth parameters and the cut-off point are then simultaneously

deter-mined so that the sum of the two in-sample error rates tin(p, bq, bH) = ain(p, bq, bH) + bin(p, bq, bH)

is minimal, subject to the constraints p∈ [0, 1], ain(p, bq, bH) ≤ u and bH≥ bq. Note that such selected values pˆ(u), bˆq(u) and bˆH(u) for p*(u), b*q and b*H, respectively, also depend on the training sample.

Hence they are also considered as constants in the prediction system.

Applications to case–control data

The purpose of this paper is to develop a semiparametric bankruptcy prediction model. Previous results show that a fairly simple SLM can be developed using the training sample, which consists of simple random observations. However, as we stated earlier, many financial data are sampled using

ˆ exp ˆ ˆ exp ˆ ˆ p D X x Z z H x z H x z R R R R R = = = ( )=

{

( )+

}

+

{

( )+

}

1 1 0 0 0 0 0 0 , q q p D X x Z z H x z H x z i i i i i i = = = ( )=

{

+ ( )+

}

+

{

+ ( )+

}

1 1 0 0 , exp ˆ exp ˆ , q q q q

(8)

case–control designs. For example, in Ohlson, the bankrupt companies were generated from a list of failed companies (cases) satisfying certain inclusion criteria, and a sample of nonbankrupt com-panies was obtained from COMPUSTAT (controls). Further, in the analysis given by Grice and Dugan (2001), the bankrupt companies were collected from those reported by COMPUSTAT meeting certain conditions (cases), and the nonbankrupt companies were collected from those that did not receive poor S&P ratings (controls). So the basic question is: can we use the case–control data to develop SLM as the LLM did? The answer is yes, but it needs justification.

Applying case–control data to the SLM as if they were simple random observations is permissi-ble, since our theoretical results in Appendix A show that the value of q can be consistently esti-mated, and the unknown function H(x) can be correctly estiesti-mated, up to the unknown additive constant

Using the case–control data, inferences about the constant c* are not possible since such data gen-erally provide no information about the population frequency of bankrupt companies. Thus treating the case–control data as if they were simple random observations and applying the procedures out-lined in the third subsection, we can only estimate H(x0) + c* and q. Specifically, let HˆCC(x0) and qˆCC

be denoted as the estimators derived from the third subsection, then qˆCC consistently estimates q,

but for any x0HˆCC(x0) estimates H(x0) + c*. Combining the inconsistency of HˆCC(x0) and the fact that the unknown quantity c* is generally not equal to 0, the resulting predicted score HˆCC(x0) + qˆCCz0

does not estimate the true score H(x0) + qz0, and thus the predicted bankruptcy probability

does not estimate the true bankruptcy probability (1) with predictor values (x0, z0). This is the major difference between applying the SLM to case–control data and to prospective data.

Fortunately, we still can use qˆCCand HˆCC(x) to develop a bankruptcy prediction device by

apply-ing the followapply-ing simple equivalent inequalities:

if and only if

This result is to say that using the probability to define a classification device

with cut-off point p is equivalent to using the probability to define a classification device with cut-off point p*. Hence we may pretend pˆCC(D= 1 | X = x0, Z= z0) to be

exp * exp * H x c z H x c z ( )+ +

{

}

+

{

( )+ +

}

q q 1 exp exp H x z H x z ( )+

{

}

+

{

( )+

}

q q 1 exp * exp * * exp * exp * . H x c z H x c z p p c p p c ( )+ +

{

}

+

{

( )+ +

}

> =( − )+ ( )( ) q q 1 1 exp exp H x z H x z p ( )+

{

}

+

{

( )+

}

> q q 1 ˆ exp ˆ ˆ exp ˆ ˆ p D X x Z z H x z H x z CC CC CC CC CC = = = ( )=

{

( )+

}

+

{

( )+

}

1 1 0 0 0 0 0 0 , q q c* log=

{

p D( =0) p D( =1)

}

+log(n1 n0).

(9)

the estimate of the true bankruptcy probability and as before use it to determine the associated cut-off point p*(u) and bandwidth parameters b*q and b*H(see the fourth subsection), and then the

semi-parametric bankruptcy prediction device.

Before closing this section, we remark that, based on the same argument for bankruptcy predic-tion, the methods of LLM using prospective data and case–control data are considered to be essen-tially equivalent.

A REAL DATA EXAMPLE

In this section, a real case–control dataset is analyzed using our SLM method and prediction rules DAM, LLM and KMV. McKee (2003) pointed out that company asset size and industry are signif-icant factors affecting bankruptcy status. Thus an ideal approach is to stratify companies according to industry and asset size and determine prediction model for each stratum. Unfortunately, we did not have enough data from COMPUSTAT and CRSP databases for doing so. Thus, to illustrate our method, we simply used two controls to match with one case so that they had the same standard industry classification (SIC) code and similar company asset size from the same year. By doing this, it is clear that the company asset size has no more power in discriminating the bankruptcy status of the company and thus will not be included in the analysis of our example.

We now introduce the case–control dataset. The dataset contains 79 companies that were delisted and declared bankruptcy (cases) during the period 1994–2002 by COMPUSTAT as meeting Chapter 11 Bankruptcy or Chapter 7 Liquidation. After identifying these companies filing for bankruptcy, both COMPUSTAT and CRSP databases were searched to locate the latest annual financial data prior to the delisting date. Thus the annual financial data for the identified bankrupt companies were from the period 1993–2001. Among the 79 selected bankrupt companies, each was matched with two non-bankrupt companies, except for two companies only matched with one nonnon-bankrupt company each, due to the incompleteness of the two databases. Hence our dataset also contains 156 nonbankrupt companies (controls). The total number of companies in this research was n = 235.The financial institutions were eliminated from the sample due to the unique capital requirements and regulatory structure in that industry group.

We note that COMPUSTAT provides 233 companies whose common stocks were traded on the New York Stock Exchange, American Stock Exchange or NASDAQ, and that were declared bank-rupt during the period 1994–2002. But since COMPUSTAT and CRSP databases contain many missing values for the predictors studied in our example, we only found 79 bankrupt companies with complete predictor values. There are no additional criteria imposed on the bankrupt companies in our case–control sample. The problem of missing data is not unusual in applications, especially when there are many predictive variables used in the model. As long as the missingness occurs ‘at random’ then it will not introduce systematic biases in our analyses (Little and Rubin, 2002). We have no reason not to believe that the missingness occurring in COMPUSTAT and CRSP databases is ‘missing at random’.

Information about industry and company asset size of the selected companies is given in Tables I and II, respectively. The two-sample median test was performed to test the null hypothesis of equal magnitude of the asset size for a nonbankrupt company and that for a bankrupt company. The p-value given in Table II shows that there is no significant difference between both company asset sizes at significance level 0.05. This result indicates that our matching process has successfully created similar asset sizes for bankrupt and nonbankrupt companies in our case–control sample.

(10)

For predicting bankruptcy, the values of the nine variables used by Ohlson (1980) and the two variables suggested by Shumway (2001) were collected for our selected companies from COMPU-STAT and CRSP databases. The 11 predictive variables are as follows:

1. TLTA= total liabilities divided by total assets. 2. WCTA= working capital divided by total assets. 3. CLCA= current liabilities divided by current assets. 4. NITA= net income divided by total assets.

5. FUTL= funds provided by operations divided by total liabilities.

6. CHIN = (NIt− NIt−1) / (|NIt| + |NIt−1|), where NItis net income for the most recent period.

7. INTWO = one if net income was negative for the last two years, zero otherwise. 8. OENEG = one if total liabilities exceed total assets, zero otherwise.

9. SIZE = logarithm of total asset divided by GNP price-level index. The index assumes a base value of 100 for 1991.

10. Relative size = logarithm of each firm’s market equity value divided by the total NYSE/ AMEX/NASDAQ market equity value.

11. Excess return = monthly return on the firm minus the value-weighted CRSP NYSE/ AMEX/NASDAQ index return cumulated to obtain the yearly return.

Note that Ohlson (1980) suggested using the first nine variables as predictive variables. But in this paper we only used the first eight variables as the predictive variables in our case–control data analysis. The ninth variable, SIZE, was not used as a predictive variable because the total asset had already been used as the matching factor in the process of selecting the case–control sample for study. The last two variables are the market-driven variables used in Shumway (2001).

Table I. The SIC codes of our case–control sample

SIC category Number of bankrupt companies Number of nonbankrupt companies

1000–1999 4 8 2000–2999 11 22 3000–3999 21 40 4000–4999 5 10 5000–5999 18 36 6000–6999 3 6 7000–7999 13 26 8000–8999 4 8 Total companies 79 156

Table II. Summary statistics of company asset sizes (in million US dollars) from our case–control sample 79 bankrupt companies 156 nonbankrupt companies Median-stat (p-value)

Mean 105.103 150.508 0.092 (0.927)

Median 32.211 33.599

SD 290.254 808.741

Min. 1.447 1.636

(11)

For predicting bankruptcy, we further computed the KMV–Merton default probabilities pKMVfor

the selected companies in our case–control data analysis. The detailed computation procedure of pKMVcan be referred to Bharath and Shumway (2004).

Pairwise scatter diagrams of our case–control sample on the continuous variables are presented in Figure 1. From the figure, it is clear that the distributions of these variables are fat-tailed and skewed, and it is very difficult to perform bankruptcy prediction visually, since most data points are clustered together.

The summary statistics of the 10 predictive variables considered in our case–control data analy-sis are presented in Table III. For each of these 10 variables, the two-sample median test was per-formed to test the null hypothesis of equal magnitude for a nonbankrupt company and for a bankrupt company. The p-value in Table III shows that the null hypothesis of equal magnitude for cases and controls is significant at the 0.05 level for each predictive variable. This result indicates that each of these variables should be an effective predictive variable. On the other hand, the summary statistics and the frequency distribution of the values of pKMVfor the selected companies in our case–control

data analysis are shown respectively in Table III and Figure 2. The results also indicate that pKMV

has good predictive power.

Given our case–control sample, the bankruptcy prediction rules associated with DAM, LLM, KMV and SLM were estimated. Their performance was measured by the out-of-sample error rate, which was computed on each of the 100 testing samples randomly selected from the given case–control sample. Each testing sample was composed of 50% of bankrupt companies and their matched nonbankrupt companies. The data not included in the testing sample were taken as the train-ing sample, and were used to develop the prediction rule.

Under SLM, kernel function K was taken as the Epanechnikov kernel. To compute the out-of-sample error rate for the prediction rule based on SLM on each testing out-of-sample, the procedure given in the second section for computing the in-sample total error rate tin(p, bq, bH) = ain(p, bq, bH) + bin(p,

bq, bH) on the training sample was applied to choose the values of (p, bq, bH). We computed tin(p,

bq, bH) on the equally spaced logarithmic grid of 1001 × 501 × 501 values of (p, bq, bH) in [0, 1] ×

[1/10, 15] × [1/10, 15]. Given each value of u ∈ [0, 1], the global minimizer {pˆ(u), bˆq(u), bˆH(u)} of

tin(p, bq, bH) on the grid points with the restrictions ain(p, bq, bH) ≤ u and bH> bqwas taken as the selected values for (p, bq, bH).

Using the selected values of {pˆ(u), bˆq(u), bˆH(u)} and the training sample, the values HˆCC(xi) and

qˆCCwere computed for each data point (xi, zi) in the testing sample. The company with the

predic-tor values (xi, zi) in the testing sample was classified as a bankrupt company if

otherwise a healthy company. After the classification procedure was completed for each company in the testing sample, the out-of-sample error rates

bSLM y i x z i i i x z i u D I p u D i i i i ( )= ∑ ( − )

{

> ( )

}

∑ ( − ) ( ) ( ) : : ˆ ˆ , in testing sample , in testing sample , 1 1 aSLM y i x z i i i x z i u D I p u D i i i i ( )= ∑

{

≤ ( )

}

∑ ( ) ( ) : : ˆ ˆ , in testing sample , in testing sample , ˆ exp ˆ ˆ exp ˆ ˆ ˆ , y q q i CC i CC i CC i CC i H x z H x z p u =

{

( )+

}

+

{

( )+

}

> ( ) 1

(12)

Figure 1. Pairwise scatter diagrams of our case–control sample on Shumway’s two market-driven variables, Excess Return and Relative Size, and Ohlson’s six continuous variables. Each graph plots 156 nonbankrupt companies (+) and 79 bankrupt companies (×) selected from COMPUSTAT and CRSP databases

(13)

Table III. Summary statistics of variables in our case–control sample

Variable Mean Median SD Min. Max. Median-stat (p-value)

79 bankrupt companies TLTA 0.801 0.747 0.435 0.020 2.450 −5.432 (0.000) WCTA 0.040 0.075 0.387 −1.192 0.980 4.511 (0.000) CLCA 1.711 0.857 3.545 0.020 23.214 −4.603 (0.000) NITA −0.423 −0.161 0.649 −2.833 0.182 5.891 (0.000) FUTL −0.335 −0.051 0.921 −4.953 1.279 5.339 (0.000) CHIN −0.251 −0.363 0.655 −1.000 1.000 3.130 (0.002) INTWO 0.570 1 0.498 0 1 −4.612 (0.000) OENEG 0.190 0 0.395 0 1 −3.844 (0.000) Excess return −0.254 −0.634 1.258 −1.320 6.617 3.682 (0.000) Relative size −5.803 −5.830 0.675 −7.379 −4.577 4.234 (0.000) pKMV 0.413 0.331 0.383 0.000 1.000 −6.537 (0.000) 156 nonbankrupt companies TLTA 0.486 0.478 0.273 0.029 1.926 WCTA 0.276 0.291 0.258 −0.592 0.921 CLCA 0.707 0.509 0.796 0.055 6.904 NITA −0.079 0.024 0.386 −3.800 0.249 FUTL −0.030 0.110 0.715 −3.387 2.544 CHIN −0.015 0.052 0.573 −1.000 1.000 INTWO 0.263 0 0.442 0 1 OENEG 0.038 0 0.193 0 1 Excess return −0.131 −0.289 0.631 −1.246 2.503 Relative size −5.284 −5.320 0.659 −6.838 −2.821 pKMV 0.114 0.001 0.241 0.000 0.989

Figure 2. The frequency histogram of the values of pKMVfor the 156 nonbankrupt companies and that for the

(14)

of the bankruptcy prediction rule based on SLM were computed, for each given value of u. For the given value of u, aSLM(u) is the out-of-sample type I error rate of classifying the bankrupt

compa-nies as healthy ones, and bSLM(u) is the out-of-sample type II error rate of classifying the healthy

companies as bankrupt ones from the testing sample. After the computation procedure was com-pleted for each testing sample, the average of each out-of-sample error rate over the 100 testing samples was computed.

The same computation procedures were applied to the prediction rules based on DAM, LLM and KMV. The bankruptcy prediction method based on KMV was performed by taking the predicted bankruptcy probability pˆ(D = 1 | X = xi, Z = zi) as the value of KMV–Merton default probability

pKMV associated with the ith company. Let {aDAM(u), bDAM(u), tDAM(u)}, {aLLM(u), bLLM(u), tLLM(u)}

and {aKMV(u), bKMV(u), tKMV(u)} be similarly defined as the out-of-sample error rates for DAM, LLM

and KMV. The results for applying the four discussed bankruptcy prediction rules to our case–control data are shown in Figure 3 and Table IV.

Figure 3 presents the three (averaged) out-of-sample error rates for the four prediction models under one (100) testing sample(s). These error rates were derived under the constraint that the type I error rate was at most u. If no such constraint is required, we simply take u= 1 and the related out-of-sample error rates are given in Table IV. For the case of u= 1, both SLM and KMV give smaller out-of-sample type I error rates than DAM and LLM. Nevertheless, KMV has the largest out-of-sample type II error rate among the four competing prediction rules. DAM and LLM show rather similar behavior in the sense of having almost the same averaged out-of-sample types I and II error rates. In terms of the total error rate, however, Table IV confirms that SLM has the best overall performance. Thus it is fair to say that by a reasonable margin the most accurate model listed in Table IV is the SLM.

From Figure 3, we find that similar conclusions to those shown in Table IV can also be drawn. For u≤ 0.2, KMV has the smallest averaged out-of-sample type I error rate. However, it also has the largest averaged type II error rate in this range. For u> 0.2, KMV has a similar averaged type II error rate to SLM but a larger type I error rate than SLM. For u∈ [0, 1], DAM and LLM show very similar performance. However, comparing the four prediction rules based on averaged out-of-sample total error rate, Figure 3 shows that SLM has the best overall performance.

SIMULATION STUDIES

In this section, a simulation study was performed to compare the performance of the prediction rules based on DAM, LLM and SLM. We first introduce the simulation settings. The dimension of the continuous predictor X was d= 2, and that of the discrete predictor Z was q = 1.Two skewed and fat-tailed distributions for the simulated X= (X1, X2) were considered. In the first case, the skewed Student’s t distribution (Fernandez and Steel, 1998) with degrees of freedom k and scale parameter s was considered. The simulated control (nonbankruptcy) Xjvalues were taken from the skewed

Student’s t distribution with (k, s) = (3, 2) for j = 1, and (7, 4) for j = 2, and those for case (bank-ruptcy) values (k, s) = (5, −3) for j = 1, and (5, 2) for j = 2 were used. In the second case, the Pareto distribution (Siegrist, 2005) with shape parameter a and scale parameter s was considered. Similarly, the values of (a, s) of both Pareto random variables X1and X2for controls were (3, 2) and (7, 4), and those for cases were (5, −3) and (5, 2), respectively.

(15)

Figure 3. Parts (a)–(c) show respectively three out-of-sample error rates of the prediction methods associated with KMV, DAM, LLM, and SLM estimated with one testing sample. Parts (d)–(f) show respectively the aver-ages of the three out-of-sample error rates over the 100 testing samples. Each testing sample was composed of 50% of bankrupt companies and their matched nonbankrupt companies in our case–control sample

Table IV. Given the value of u= 1, the values of the three out-of-sample error rates shown in (a)–(c) of Figure 3, and those shown in (d)–(f) of Figure 3 (given in parentheses)

KMV DAM LLM SLM

Type I error rate 0.250 (0.253) 0.375 (0.290) 0.350 (0.296) 0.200 (0.202) Type II error rate 0.405 (0.328) 0.241 (0.278) 0.228 (0.287) 0.291 (0.321) Total error rate 0.655 (0.581) 0.616 (0.568) 0.578 (0.583) 0.491 (0.523) Given marginal distributions of X, the simulated control X values with size 200 were generated using mean vector (m0, 0) and covariance matrix

1 25 1 250 1 250 25 , , 1 − −   

(16)

and their associated Z values were independently generated from a binary random variable with the probability p(Z= 1) = 1/3.The simulated case X values with size 100 were similarly generated with mean vector (m1, 0) and covariance matrix

and their associated Z values were independently generated from a binary random variable with the probability p(Z= 1) = 2/3. Two sets of the values (m0, m1) = (−0.1, 0.1) and (−0.3, 0.3) were con-sidered. For each distribution of X and each set of values (m0, m1), 100 independent sets of the case–control data were generated. Given each case–control data set, one testing sample was ran-domly selected, and was composed of 50% of cases and their matched controls.

Three bankruptcy prediction methods based on DAM, LLM and SLM were considered in this sim-ulation study. The computation procedures and the measures of performance presented in the third section were applied to the three prediction methods. For SLM, the equally spaced logarithmic grid of 201 × 201 values of (bq, bH) in [1/10, 2] × [1/10, 2] were employed for selecting values of (bq, bH),

and the Epanechnikov kernel was used. The simulation results are presented in Figures 4 and 5. Figure 4 presents averages of out-of-sample error rates over the 100 simulated datasets for the three bankruptcy prediction methods under the case of the skewed Student’s t distribution for X. From the figure, our SLM performs better than DAM and LLM, since for most values of u our pre-diction method has a smaller average of out-of-sample error rate of any type. Further, the smaller the absolute difference | m0− m1|, the more significant the advantage of SLM over DAM and LLM. The forecasting performance of the three prediction methods in the case of the Pareto distribution for X is shown in Figure 5. The results from Figure 5 also confirm that SLM has the best overall performance.

CONCLUDING REMARKS AND FUTURE RESEARCH TOPICS

In this paper, bankruptcy prediction methods based on SLM are proposed for the prospective and the case–control data. Our SLM is developed by replacing the linear logit function of the LLM with an unknown but smooth logit function. Hence it is more flexible and robust than the LLM. The esti-mators for unknown quantities in the SLM are computed by the local likelihood method, and their large sample properties are studied through their asymptotic biases and variances. We point out that, under the case–control data, the estimated bankruptcy probability does not estimate the true bank-ruptcy probability, unless the quantity c* = log{p(D = 0)/p(D = 1)} + log(n1/n0) is 0. In contrast, for the prospective data, our estimated probability does estimate the true bankruptcy probability. This is the major difference between the applications of the logit model to the case–control sample and the prospective sample. However, using the fact that the logistic distribution is strictly monotone increasing, we discover that such estimated probability still can be used to develop a bankruptcy prediction device.

To decide the optimal prediction rule, we propose to control the in-sample type I (II) error rate to be at most u, so that the sum of in-sample type I and II error rates is minimal. This is sometimes essential since the type I (II) error would cause much more severe losses to the investors. The value of u∈ [0, 1] is determined by the investor. If u = 1, then there is no restriction on the magnitude of in-sample type I and II error rates. Our results from one real data example based on eight predictor

1 4 1 8 1 8 4 , , 1   

(17)

variables of Ohlson (1980) and two market-driven variables of Shumway (2001) and simulations confirm that the SLM performs better than the DAM, LLM and KMV, in the sense of having smaller out-of-sample total error rate.

In applications of the SLM to the bankruptcy prediction problem with case–control data, the rela-tion between the d-dimensional predictive variable x and the logit of bankruptcy probability can be obtained from the plot of the estimated function HˆCC(x). For example, the relation between the jth

component of x and the logit of bankruptcy probability can be seen from the plot of HˆCC(x) with the

fixed values of other components of x with respect to the jth component of x, for each j= 1, . . . , d. The same remark applies to the plot of HˆR(x) with prospective data.

In order to estimate the SLM, we need to decide proper values of the bandwidth parameters bq and bH. In this paper, we suggest estimating the bandwidth parameters so that the sum of the

Figure 4. Given (m0, m1) = (−0.1, 0.1) in the case of the skewed Student’s t distribution for X, parts (a)–(c) show respectively the sample averages of the three out-of-sample error rates of the prediction methods based on DAM, LLM and SLM over the 100 simulated case–control datasets. For each simulated case–control dataset, one testing sample was randomly selected, and was composed of 50% of cases and their matched controls. The corresponding results for the case of (m0, m1) = (−0.3, 0.3) are shown in parts (d)–(f)

(18)

corresponding in-sample type I and II error rates is minimized subject to some restrictions. This approach may suffer from the heavy computational burden. One possible remedy for this drawback is to use the plug-in method to estimate these bandwidth parameters. For example, given case–control data, we may determine the bandwidth parameter to minimize the estimated mean square error of each estimator HˆCC(x) and qˆCC. For more discussion of the plug-in method, see, for example, Härdle

et al. (1992) and Jones et al. (1996).

In the case of massive data, a remedy for reducing the computational burden for SLM is to bin the data. See Fan and Marron (1994) and the monograph by Härdle (1991) for a detailed introduc-tion of the binning method. On the other hand, in the case that there is a large number of predictors available, we may perform variable selection for SLM. See, for example, Härdle et al. (2006) for a variable selection method based on forward selection. These two topics are important in practice. They are our future research topics.

Three extensions of SLM are outlined below. Firstly, Shumway (2001) criticized that the bank-ruptcy model using single period data is static in nature, since it ignores the fact that the character-Figure 5. As character-Figure 4, with the skewed Student’s t distribution for X replaced by the Pareto distribution for X

(19)

istics of firms change through time. To avoid the drawback to the static model, Shumway (2001) and Chava and Jarrow (2004) apply the idea of survival analysis (Cox and Oakes, 1984) to develop the discrete-time hazard model (DHM) for bankruptcy prediction. Their DHM has the advantage of using all available history information to determine each firm’s bankruptcy risk at each point in time; hence it is a dynamic forecasting model. However, their DHM is also based on the assumption of simple linear logit relation for the hazard function. This assumption may not be valid as in the LLM. We point out that the idea of SLM can also be directly applied to the DHM using panel data.

Secondly, in the SLM, we assume that H(x) is an unknown but smooth function, and a local like-lihood method has been developed to estimate H(x). However, the resulting estimator HˆCC(x) suffers

from the curse of dimensionality; that is, as the dimension of the continuous predictor X increases, the performance of the resulting HˆCC(x) deteriorates. For example, from Remark 1 of Appendix A,

the minimum mean square error of HˆCC(x) with respect to bHis of order n−4/(d+4)in magnitude. Such

mean square error increases as the value of d increases. To avoid such a drawback, one possible remedy is to consider an additive model for H(x):

as described by Hastie and Tibshirani (1990). Here x= (x1, . . . , xd)Tand Hj(xj) is any unknown but

smooth function of xj, the jth component of x, for each j= 1, . . . , d. This is basically equivalent to

mapping the original predictors to the transformed variables having the desired linear relation in the LLM.

Finally, the logit function of our SLM is basically an additive model with H(X) and qZ. This assumption will be violated if X and Z are interactive. A possible solution to this problem is to intro-duce a nonparametric interaction such as G(X)Z, where G(·) is a q-dimensional row vector of unknown but smooth functions, in the model. It will be interesting to study the estimates of func-tions H(x) and G(x) simultaneously.

APPENDIX A: THEORETICAL RESULTS

In Appendix A, we shall study the asymptotic properties of estimators Hˆ(x), qˆCCand HˆCC(x) given

in the second section with case–control data. For these, the composition of the case–control sample and the formulations of these estimators are recalled. According to the case–control sampling, we draw a random sample of nonbankrupt companies of n0observations (controls), say (x1, z1), . . . , (xn₀, zn₀), from the conditional distribution of (X, Z) given D= 0, and an independent random sample

of bankrupt companies of n1observations (cases), say (xn₀+1, zn₀+1), . . . , (xn, zn), where n= n0+ n1, from the conditional distribution of (X, Z) given D = 1. Hence we have the case–control sample (Di, xi, zi), i= 1, . . . , n, where Di= 0 for i ≤ n0, and 1 for i> n0. Thus if f0(x, z) = f(x, z | D = 0) and f1(x, z) = f(x, z | D = 1) are the conditional frequency functions of (X, Z) given D = 0 and 1, respec-tively, then from Bayes theorem and (1), these two frequency functions can be related by

(5) where H*(x) = H(x) + log{p(D = 0)/p(D = 1)}.

Given the case–control sample and the bandwidth parameters bqand bH, by the development of

the logistic regression in the case–control setting given in Section 6.3 of Hosmer and Lemeshow

f x z1( , )= f x z0( , )exp

{

H*( )x +qz

}

, H x( )=H x1( )1 +. . .+H xd( )d,

(20)

(1989), the log-likelihood functions corresponding to Hˆ(x), qˆCC and HˆCC(x) given in the second

section with general kernel function K can be expressed respectively as

(6)

(7)

(8)

Note that the parameter b in (6) and (8) represents the unknown quantity H(1) (x)T

, as it did in (3) and (4), with x replaced by x0. But a in (6) and (8) stands for H(x) + c*, not as it did in (3) and (4) for H(x), with x replaced by x0. Here c* = log{p(D = 0)/p(D = 1)} + log(n1/n0), and it has been defined in the second section. Hence the maximum likelihood estimates of a produced from (6) and (8) would be estimates for H(x) + c*, not for H(x). This fact causes the major difference between the applications of the logit model to the case–control sample and the prospective sample.

To study the asymptotic properties of Hˆ(x), qˆCCand HˆCC(x), we need the following conditions:

C1. Kernel function K(u) is a symmetric and Lipschitz continuous probability density function sup-ported on [−1, 1], and is bounded above zero on [−1/2, 1/2].

C2. n0/n→ z ∈ (0, 1), as n → ∞.

C3. Bandwidth parameters bq, bH∈ [dn−1+d, d−1n−d], for some d satisfying 0 < d < 1/2. They also

satisfy nbqd+2>> 1 >> bqand nbHd >> 1 >> bH>> bq. The notation an>> bnmeans that bn/an→

0, as n → ∞.

C4. The d-variate function H(x) is defined on [0, 1]d

, and each of its second-order partial derivative is Lipschitz continuous on [0, 1]d

.

C5. Under control and case populations, their respective marginal densities f0(x) and f1(x) of X are Lipschitz continuous and bounded above zero on [0, 1]d

. Also, their corresponding conditional probabilities f0(z | x) and f1(z | x) of Z given X= x cannot be zero or one for each given x, and are Lipschitz continuous with respect to x.

Conditions C1–C4 are regular for the usual nonparametric regression analysis. The support [0, 1]d

in C4 and C5 of the d-dimensional variable X is given for simplicity of presentation. It can be replaced with any bounded region Ω ⊂ Rd

, and the asymptotic properties for the resulting qˆCCand

HˆCC(x) remain unchanged. The first part of condition C5 guarantees that the design points X, under

control and case populations, have no holes on [0, 1]d

. The second part of C5 makes sure that the Hessian matrix for each 1(a, b, q) and 3(a, b) is invertible.

In order to give concise expressions for the asymptotic properties of Hˆ(x), qˆCC and HˆCC(x), we

need more notations. Let l3 1 1 1 1 0 a b a b q a b q , ( )= −( )

[

+

{

+ ( − )+

}

]

( − ) +

{

+ ( − )+

}

( − ) = = +

∑

log exp ˆ ˆ _. x x z K x x x x z K x x i CC i i n b i i CC i i n n b i H H l2 0 0 1 0 1 1 1 0 q q, q q q q ( )= −( )

[

+

{

+ ( )+

}

]

+

{

+ ( )+

}

= = +

∑

log exp H xˆ i zi

∑

H xˆ z , i n i i i n n l1 1 1 1 1 0 a b q a b q a b q q q , , ( )= −( )

[

+

{

+ ( − )+

}

]

( − ) +

{

+ ( − )+

}

( − ) = = +

∑

log exp , x x z K x x x x z K x x i i i n b i i i i n n b i

(21)

Q be the collection of all values of the discrete q-dimensional variable Z, and I1be the (1 + q) × (1 + q) identity matrix with the first column vector of the identity matrix deleted, for i, j = 1, . . . , d and k≥ 0. Here K*(t) is the d-variate Lejeune–Sarda kernel function of order two (Lejeune and Sarda, 1992). In particular, given the point x∈ [0, 1]d

, the kernel function K, and the bandwidth b, K*(t) can be expressed as

and its corresponding values cijbecome

Define

Define quantities related to the asymptotic biases and variances of Hˆ(x), qˆCC and HˆCC(x) in the

following:

If x is in the interior region [b, 1 − b]d

of [0, 1]d

, then it can be seen that the values of cH(x; b) and

vH,1(x; b) become v D D D D D DT D DT D q= −

(

− −

)

+ − ₋ ₋ 2 1 1 0 1 2 1 1 1 1 2 1 2 1 . c b D x D x cT x b x D I H q( )= −( )1

[

∫

{

0( ) 1( )

}

( )

]

− 0 1 0 1 1 1 . .. , ; d , vH,2 x b; 02 0D x0 D x D x D xT D x D xT D x, 1 1 0 2 1 1 1 1 ( )=_{l t}− ( )− ( )

{

( ) ( )− ( ) ( )

}

− ( ) cH x c H xij ij v x b D x K t t j d i d H m k m k d d ; b , , ; d , ( )=( ) ( ) ( )= ( ) ( ) = = −

∑

∫

1 2 1 1 1 0 1 2 1 1 . . . * D x zz r x z D D x x j D D D D D T z Q j j T 2 0 1 0 1 0 1 1 2 0 ( )= ( ) = ( ) = _{= } _ ∈

∑

, ,

∫

. . .

∫

d , for , 1, 2, , , . r x z f x z H x c z H x c z D x z Qr x z D x z Qzr x z , , , 0 ( )= ( )

{

( )+ +

}

+ −( )

{

( )+ +

}

( )=

∑

∈ ( ) ( )=

∑

∈ ( ) 0 1 1 exp * exp * , , , , q z z q c i j i j ij i i i i i i i i j j =

(

−

)

(

−

)

= − ( ) ≠    − − − l l l l l l l l l l ,2 2 ,1 ,3 ,0 ,2 ,2 2 ,1 , ,1 , , for for 1 0 1 0 1 1 , , . K t i K ti t i d i i i i i i i i d *( )=_ ( ) ,     − ( − )

(

−

)

      − = − =

∏

l,01

∑

l,0 l,1 l,1 l l,0 ,2 l2,1 1 1 1 1 l0 t0 l 1 1 1 1 =

_∫

. . .

_∫

_{K t t}#( ) =

_∫

. . .

_∫

_{K t}#( ) _t =

_∫

_{u K u u}( ) , m k m k m k m k i k k m k d d d d i i d , 2d , , d mi xi b ki x bi cij t t Ki j t t m k m k d d =max

{

−1( −1)

}

=min

{

1

}

=

_∫

. . .

_∫

*( ) 1 1 , , , , d , x x xd t t t H x x x H x K t K t T d T ij i j j j d =( ) =( ) ( )= ( ) ( ) ( )= ( ) =

∏

1 1 2 1 , . . . , , , . . . , , ∂ ∂ ∂ , # ,

(22)

The following Theorem 1 states the asymptotic bias and variance for Hˆ(x), and those for qˆCCand

HˆCC(x). Its proof will be given in Appendix B.

Theorem 1: Under the SLM and the case–control sample, suppose that conditions C1–C5 are

sat-isfied. For each x∈ [0, 1]d

and as n→ ∞: (9) (10) (11) (12) (13) (14)

Remark 1: The optimal kernel function K and the magnitudes of the optimal bandwidth

parame-ters b*qand b*H for constructing qˆCCand HˆCC(x). By Theorem 8 of Fan et al. (1993) and our Theorem

1, the optimal K satisfying the conditions in C1 for constructing HˆCC(x) is the Epanechnikov kernel,

for each x∈ [0, 1]d

, in the sense of having smaller asymptotic mean square error. On the other hand, by (13) and (14), the optimal choice of bH, in terms of having smallest mean square error of HˆCC(x),

is b*H = c*Hn−1/(d+4), where c*His a constant depending on the unknown factors H(·), q and f0(x, z). Sim-ilarly, by (9)–(12) and C3, the optimal value b*q of bq, in terms of having smallest mean square error of qˆCC, satisfies the condition n−1/(d+4)>> b*q >> n−1/(d+2). Hence we conclude that the value of b*His of

larger order than that of b*q, and that the mean square error of HˆCC(x) using the optimal bandwidth

parameter b* is of smaller order in magnitude than that of Hˆ(x) using b*H q.

APPENDIX B: SKETCHES OF THE PROOFS

In Appendix B, sketches of the proof for Theorem 1 will be given. The following notations will be used. Let j(1)and j(2)be the gradient vector and the Hessian matrix of j, for each j= 1, 2, 3, given

in (6)–(8), respectively. Also, H(2)

(x) is the Hessian matrix of H(x). Define P0 as the event that the number of control data points falling into the neighborhood N(x; b/2) of x is less than r0n0N(x; b/2)f0(t)dt, where r0is a positive constant satisfying r0≤ 1/4, and Q0the event that the number of control data points falling into the neighborhood N(x; b) of x is greater than j0n0N(x; b)f0(t)dt, where

j0is a positive constant satisfying j0≥ exp(1). The definition of neighborhood N(x; b) of the given point x has been given in (3). The events P1and Q1are similarly defined for case data points with n0, f0, r0 and j0 replaced respectively by n1, f1, r1 and j1.

The proofs of the asymptotic bias and variance for each Hˆ(x), qˆCCand HˆCC(x) are given below in

sequence. Var HˆCC x n bH v , x b; +O n b . d H H H d ( )

{

}

= −1 − −1( − )−1 ( )

(

− − +

)

1 1 1 1 z z Bias HˆCC x E HˆCC x H x c* b cH H x b; H O bH b n bH , d ( )

{

}

=

{

( )

}

− ( )− = 2 ( )+

(

3+ +2 −1 −

)

q Var

( )

qˆCC =n−z−( −z) vq+O n b

(

q

)

, − ₋ 1 1₁ 1 1 BiasqˆCC EqˆCC q q q q q q , d b c b O b n b

( )

=

( )

− = 2 ( )+

(

3+ −1 −

)

Var H xˆ n b v , x b; v , x b; O n b , d H H d ( )

{

}

= −1 − −1( − )−1

{

( )+ ( )

}

+

(

− − +

)

1 2 1 1 1 q z z q q q Bias H xˆ E H xˆ H x c* b cH x b; O b n b , d ( )

{

}

=

{

( )

}

− ( )− = ( )+

(

+ − −

)

q2 q q3 1 q cH x b u K u u H xii v x b D x K u u i d H d ; d , , ; d 2 ( )=( )

{

₋ ( )

}

_ ( )_ ( )= ( )

{

( )

}

= − −

∫

∑

∫

1 2 2 1 1 1 1 0 1 1 1 .

(23)

Proof of the asymptotic bias and variance for Hˆ (x)

Set w= (a, b, q)T

and wˆ = (aˆ, bˆ, qˆ)T

, the maximizer of 1(a, b, q) in (6). By the first-order Taylor expansion, we have

(15) for each x∈ [0, 1]d_{, where w* lies in the line segment connecting w and wˆ.}

Using conditions C1–C5, (5), and the large deviation theorem in Section 10.3.1 of Serfling (1980), a straightforward calculation leads to the following asymptotic results: as n→ ∞,

(16) (17) (18) (19) for each w. Here

where

Using condition C3 and the results of (16)–(19) and comparing the magnitudes of 1(1)(w) = Op(nbq2+ n1/2bq−d/2) and 1(2)(w*) = Op(n) in (15), we have

(20) Using (15)–(20) and approximations to the standard errors of functions of random variables in Section 10.5 of Stuart and Ord (1987), the results of the asymptotic bias and variance of Hˆ(x) in (9) and (10) follow, respectively.

Proof of the asymptotic bias and variance forqˆCC

Set g= (q0, q) and gˆ = (qˆ0, qˆCC), the maximizer of 2(q0, q) in (7). Using the fact that q0is a nor-malizing constant for f1(x, z) = f0(x, z)exp{Hˆ(x) + q0+ qz}, the results of the asymptotic bias and

variance of Hˆ(x) in (9) and (10), C1–C5, (5), and approximations to the standard errors of functions

ˆ . w w− =op( )1 Λ1 0 0 1 0 0 1 1 1 2 x t D x t D x D x tD x t t D x t D x D x D x t D x T T T T T , , , , , , , ( )= ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )        . u x t HT x t K t t u x t H x t tK t t m k m k T m k m k d d d d 0 2 1 2 1 1 1 1 ( )=

{

( )( )

}

( ) ( )=

{

( )( )

}

( )

∫

. . . # _{d ,}

∫

. . .

∫

# _{d ,} B x t K t t C x t K t t m k m k m k m k d d d d 1 1 1 1 1 1 1 1 =

_∫

. . .

_∫

_Λ( ) ( )_, # _{d ,} =

_∫

. . .

_∫

_Λ( ) ( )_, # 2_d, A u x D x u x D x u x D xT T T 1=

{

0( ) ( ) ( ) ( ) ( ) ( )0 , 1 0 , 0 1

}

, var l1 1 1 1 1 ( )_{( )} − − +

{

w

}

=nbq z( −z)C +O nb

(

q

)

d d , E l1 n B O nb 2 1 1 1 ( )_{( )}

{

w

}

= −( ) z( −z) + ( q), E _l nb A O nb b d 1 1 2 1 3 1 2 1 ( )_{( )} −

{

w

}

=( ) qz( −z) +

(

q+ q

)

, P P( 0U U UQ0 P1 Q1)=O

{

exp(−nbq)

}

, 0 1 1 1 1 1 2 =_l( )( )_w_ˆ =_l( )( )_w +_l( )( )_w_*(_{w w}_ˆ− )_,