3.2 Over-Sampling technique
3.2.2 Replication
To tackle imbalance problem, He and Garcia (2009) put forward the idea that important information should be priority. The simple method is replication. In this research, great emphasis is put on the default samples. In other words, the default samples will be replicated several times.
In the experiment of my study, default samples in original training set were multiplied until they equal to non-default samples. It takes about 35 times replication in this research. However, to examine the models and see how good it is, I would like to replicate 1-70 times of default samples.
3.6.3 Synthetic Minority Over-sampling Technique (SMOTE)
Another way to increase number of default samples is Synthetic Minority Over-sampling Technique (SMOTE) method that was suggested by Chawla et al. (2002). The scholars proposed an sampling approach in which the minority class is over-sampled by creating “synthetic” examples rather than by over-sampling with replication.This approach is inspired by a technique that proved successful in handwritten character recognition.
Firstly, the minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Secondly, depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen. For example, if we use nine nearest neighbors and the amount of over-sampling needed is 200%. There are only two neighbors from the nine nearest neighbors are
29 chosen and one sample is generated in the direction of each. In this study, k is 25 for all group.
The synthetic samples are generated in following way: take the difference between the original minority sample and its nearest neighbor. Then, multiply this difference by a random number between 0 and 1, and add it to the original sample. This causes the selection of a random point along the line segment between two specific features. In this study, the minority class in the training set was over sampled at 100%, 200% until 5400% of its original size. The algorithm will be described in Figure 3.3 as follows:
Algorithm SMOTE (T; N; k)
Input: Number of minority class samples T ; Amount of SMOTE N%; Number of nearest neighbors k
Output: (N/100) * T synthetic minority class samples
1. ( If N is less than 100%, randomize the minority class samples as only a random percent of them will be SMOTEd. )
2. if N < 100
3. then Randomize the T minority class samples 4.T = (N/100) T
5.N = 100 6. End if
7. N = (int)(N/100) ( The amount of SMOTE is assumed to be in integral multiples of100. )
8. k = Number of nearest neighbors 9. numattrs = Number of attributes
10. Sample[ ][ ]: array for original minority class samples
30 11. newindex: keeps a count of number of synthetic samples generated, initialized to 0 12. Synthetic[ ][ ]: array for synthetic samples( Compute k nearest neighbors for each minority class sample only. )
13. for i ← 1 to T
14. Compute k nearest neighbors for i, and save the indices in the nnarray 15. Populate(N, i, nnarray)
16. End for Populate(N, i, nnarray) ( Function to generate the synthetic samples. ) 17. While N = 0
18. Choose a random number between 1 and k, call it nn. This step chooses one of the k nearest neighbors of i.
19. for attr ← 1 to numattrs
20. Compute: dif = Sample[nnarray[nn]][attr] − Sample[i][attr]
21. Compute: gap = random number between 0 and 1 22. Synthetic[newindex][attr] = Sample[i][attr] + gap dif 23. endfor
24. newindex++
25.N = N − 1 26. endwhile
27. return ( End of Populate. )
Figure 3.3 Synthetic Minority Over-sampling Technique
Chawla et al. (2002) argued that the outcome of SMOTE was better than the outcome of replication. In this research, I would like apply those methods in construction prediction model and compare the performance of each method.
31
3.3 ROC Curve
3.3.1 Concept and methodology of ROC curve
Assessment of predictive accuracy is an important aspect of evaluating and comparing models, algorithms or technologies that produce the predictions. Receiver operating characteristic (ROC) curves are common, widely applicable method which useful for assessing the accuracy of tests because they provide a comprehensive and visually attractive way to summarize the accuracy of predictions. So, in this research’s scope, the author proposes applying ROC curves to assess and compare the accuracy rate of default probability predictions which were applied by grey system analysis.
ROC curves, generalize contingency table analysis by providing information on the performance of a model at any cut-off that might be chosen (Green and Swets, 1966;
Hanley,1989; Pepe, 2002; Swets, 1988; Swets, 1996). In the simplest case, the model produces only two ratings (Bad/Good) which are shown along with the actual outcomes (default/no default) in tabular form. The cells in the table indicate the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN), respectively. FN represents a Type I error and FP represents Type II error. These fractions are presented in table 3.
TP: a predicted default that actually occurs,
TN: a predicted non-default that actually occurs
FP: a predicted default that does not occur and,
FN: is a predicted non-default where the company actually defaults.
T
33 Figure 3.2: Schematic of a ROC
3.3.2 Utilizing ROC curve to validate the model
One useful characteristic of the ROC curve is the area under the curve (AUC) clearly reflects how good the test is at distinguishing between firms with disease and those without disease. The AUC serves as a single measure, independent of prevalence that summarizes the discriminative ability of a prediction across the full range of cut-off points. The greater value of the AUC, the better the prediction is. A perfect discrimination test will have an AUC of 1.0, while a completely useless test (one whose curve falls on the diagonal line) has an AUC of 0.5. A test with an area greater than 0.9 has high accuracy, while 0.7–0.9 indicates moderate accuracy and 0.5–0.7 indicates low accuracy.
Corresponding to different circumstances of period of time (4 year data consequence) as well as X0 (sequence of characteristic data to describe the system’s behavior), the value of AUC will be calculated. And this value is the basis for evaluate and compare the accuracy of the Grey System Theory application with the different number of variables.
34
3.3 Summary
In this chapter, the author introduces the main methodology adopted in the research – grey system analysis and the way to apply this method to forecast a company default probability. By calculating synthetic degree incidence of considered firms and combine these values, the default probability of firms were identified. Beside, Over-sampling technique is used before applying Grey Theory, to address imbalance data problem. After different models are calculated, ROC curves are used to compare with the previous study.
35
CHAPTER 4: DATA COLLECTION
4.1 Data collection
4.1.1 Source and validity of data
Data in this research was gathered from COMPUSTAT Industrial File (Wharton Research Data Services) as well as the Center for Research in Securities Prices (CRSP) for construction companies of the U.S. My research concentrated on construction contractors with December fiscal year-ends by choosing firms with SIC codes between 1,500 and 1,799. Similar to the researches of Severson et al (1993) and Russell and Zhai (1996), the sample contractors include three construction categories:
Major Group 15: Building construction, general contractors and operative builders. The construction of buildings subsection includes establishments involved in constructing residential, industrial, commercial, and institutional buildings
Major Group 16: Heavy construction other than building construction contractors. The heavy and civil engineering subsection includes establishments involved in infrastructure projects.
Major Group 17: Construction special trade contractors. The specialty trade contractors engaged in activities such as plumbing, electrical work, masonry, carpentry, and roofing that are generally needed in the construction of all building types.
4.1.2 Principles of collecting data
36 The selection of firm is confined in construction industry only. 92 companies were selected as participant of the research, among which 24 were defaulted. The observed period was 1972-2008. According to Chin (2009), Tserng et al. (2008), Tserng et al. (2009), there are two main criteria in data collection principle to select samples:
1. Companies which do not have financial statement for at least 5 years will be taken out of the sample.
2. Default firms are defined by CRSP delisting code of 400 and 550 to 585, which correspond to the delisting reason concerned with company failures such as bankruptcy, liquidation of poor performance.
The chosen firms must have at least five years’ data in Compustat Industrial File to ensure that all the unhealthy firms will be excluded in the population of study as well as to consider the impact of market factors to these companies in a long term.
4.1.3 Summary of the input data
This thesis have a total of 24 failed companies among 92 construction companies which were identified during the year of determination. These firm were chosen because they are suitable to two criteria above. The number of firm may be 50 but 92 firm is big enough to improve the exact of study. Besides, to find out exact number of firm, it is beyond the thesis’ limit.
Table 4.1 disclosed the name of failed firms.
37 Table 4-1: Information of the defaulted companies
ORD CODE COMPANY'S NAME DEFAULTED
YEAR
OBSERVED FIRM‐YEARS 1 60409 AMERICAN MEDICAL BLDGS INC 1989 1978 ‐1989 2 85607 ATKINSON (G F) CO/CA 1997 1985 ‐ 1997 3 63095 BANK BUILDING &EQUIP CORP AM 1989 1972 ‐ 1989
4 11901 ENTRX CORP 2004 1988 ‐ 2004
5 86933 COMSTOCK GROUP INC 1988 1984 ‐ 1988 6 22382 CERBCO INC ‐CL A 2000 1981 ‐ 2000 7 55079 MORRISON KNUDSEN CORP OLD 1995 1972 ‐ 1995 8 58641 CANISCO RESOURCES INC 1998 1982 ‐ 1998 9 10036 NEUROTECH DEVELOPMENT CORP 1990 1986 ‐ 1990 10 29621 DEVCON INTERNATIONAL CORP 2007 1987 ‐ 2007 11 11109 CEC INDUSTRIES CORP 1994 1987 ‐ 1994 12 80220 ABLE TELCOM HOLDING CORP 1999 1994 ‐ 1999 13 76432 RYAN MURPHY INC 1994 1990 ‐ 1994 14 76796 BUILDING MATERIALS HLDG CP 2007 1991 ‐ 2007 15 77334 SHOLODGE INC 2004 1992 ‐ 2004 16 77831 XXSYS TECHNOLOGIES INC 1998 1992 ‐ 1998 17 79017 TRANSCOR WASTE SERVICES INC 1997 1993 ‐ 1997 18 10227 KIMMINS CORP 1998 1986 ‐ 1998
19 79815 COFLEXIP SA 2000 1993 ‐ 2000
20 79958 DAW TECHNOLOGIES INC 2000 1993 ‐ 2000
21 82829 NESCO INC 2000 1996 ‐ 2000
22 82731 CHINA CONVERGENT CORP LTD 2000 1996 ‐ 2000 23 85606 ENCOMPASS SERVICES CORP 2001 1997 ‐ 2001
38 24 88642 DISTRIBUTED ENERGY SYS CORP 2007 2003 ‐ 2007
4.2 Data classification
4.1.4 4.2.1 Collection of Financial ratios data
Theodossiou (1991) claimed that the selection of the independent variables for a bankruptcy prediction model is the most toughing aspect of every bankruptcy because financial theory does not indicate which variables should be included in the. The forward stepwise statistical procedure has been recognized as the most popular method used in previous studies for the development of bankruptcy prediction models. Due to some specific properties of construction finance, this research’s financial ratios are collected following prior researches (Mason and Harris(1979) ; Abidali (1990); Russel and Jaselskis (1992); Cheng, J. et al (2009); Delcea, C. &Scarlat, E) which concerned to the prediction of the probability of construction firms. Besides, the selected financial ratios must involve all the aspects of a contractor finance situation and has to include the liquidity, profitability, leverage, activity of a firm and even refer to the market factor.
The last principle to select financial ratios is all of these ratios must have a predicted relationship with the default risk.
4.2.2 Clacification of selected financial ratios
19 single financial ratios developed from financial data from 92 construction firms across a 37-year period (1972-2008) were taken into account. These ratios are classified into 4 categories of ratios (liquidity, leverage, profitability, activity) which are typically used in analyzing financial position:
39 Table 4-2: Selected ratios’ classification
1. Liquidity Ratios
No. Symbol Ratio
1 VAR1 Current Ratio
2 VAR2 Quick Ratio
3 VAR3 Net Working Capital to Total Assets 4 VAR4 Current Assets to Net Assets
2. Leverage Ratios
No. Symbol Ratio 5 VAR5 Total Liabilities to Net Worth 6 VAR6 Retained Earnings to Sales
7 VAR7 Debt Ratio
8 VAR8 Times Interest Earned
3. Activity Ratios
No. Symbol Ratio
9 VAR9 Revenues to Net Working Capital 10 VAR10 Accounts Receivable Turnover 11 VAR11 Accounts Payable Turnover
12 VAR12 Sales to Net Worth
13 VAR13 Quality of Inventory
14 VAR14 Turnover of Total Assets 15 VAR15 Revenues to Fixed Assets
4. Profitability Ratios
40 No. Symbol Ratio
16 VAR16 ROA
17 VAR17 ROE
18 VAR18 ROS
19 VAR19 Profits to Net Working Capital
4.3 Financial ratios’ definition
The definition and the sign of 19 major represented variables in table 4-3 below:
Table 4-3: Definition and usage ratios
Var. Ratio Definition Usage Sign
1 Current Ratio Current assets/
Current liabilities
A liquidity ratio that measures a company's ability to pay short-term obligations.
An indicator of a company's short-term liquidity measures a company's ability to meet its short-term obligations with its most liquid assets
Measures both a company's efficiency and its short - term financial health
+
Indicates how effectively a company is using its assets to generate cash before contractual obligations must be paid
+
5 Total
Liabilities to Net Worth
Total liabilities /
Net worth Indicates the extent to which a company is utilizing its re-investment - 6 Retained
Indicates how effectively reinvested into the company.
+
7 Debt Ratio Total liabilities / Total assets
Indicates what proportion of debt a company has relative to its assets.
-
Measures a company’s ability to honor
its debt payments. +
41
Measures a company's ability to honor its debt payment
Measures the number of times that accounts receivable amount is collected throughout the year.
A short-term liquidity measure used to quantify the rate at which a company pays off its suppliers.
-
12 Sales to Net
Worth Net sales / average net worth
Measures the number of times working capital turns over annually in relation to net sales.
Show intensity with which the firm uses assets in generating product. The
inventory quality ratio is the ratio of the active inventory dollars to total
inventory dollars
Simply compares the turnover with the assets that the business has used to generate that turnover.
Measures a company's ability to generate net sales from fixed-asset investments.
An indicator of how profitable a company is relative to its total assets. ROA gives an idea as to how efficient management is at using its assets to generate earnings.
+
17 ROE Profit before interest and taxes / Equity
Measures a corporation's profitability by revealing how much profit a company generates with the money shareholders have invested. Indicates performance and potential for growth.
+
18 ROS Profit before interest and taxes / Total sales
Evaluates a company's operational
efficiency. +
Evaluates the efficiency of a company's investment.
+
42 Table 4.3 illustrates 19 chosen financial ratios which are relatively classified in to 4 groups: liquidity, leverage, activity, profitability. The table also depicts the expected dependence between the accounting ratio and the default probability, in which symbol (+) means the bigger the ratio’s value, the healthier financial statement of company (a decrease in the default probability) and symbol (-) signifies an increase in the default probability given a decrease in the explanatory variable.
The first group includes four financial ratios - Current ratio (VAR1), Quick ratio (VAR2), Net working capital to total assets (VAR3) and Current assets to net assets (VAR4). These ratios were used as variables for liquidity. Liquidity measures a firm’s ability to meet its short-term obligations. Moyer and Chatfield (1983) propose a negative effect of liquidity on bankruptcy because high liquidity implies a low level of short-term obligations and low default risk. Therefore, in this research liquidity group symbolize as positive symptom – as the firm is solvent, its financial situation is better.
Leverage ratios, as categorized as the second group, measure the extent to which a company has been financed by debt and shareholder funds. This kind of ratio reflects the cooperative’s ability to meet both short-term and long-term debt obligations.
Leverage ratios are computed either by comparing earnings from the income statement to interest payments or by relating the debt and equity items from the balance sheet. In this category, Total liabilities to net worth ratio (VAR 5) is known as the higher this ratio, the less protection there is for creditors. If total liabilities exceed net worth then creditors have more at stake than stockbrokers. The debt ratio (VAR7) can help investors determine a company's level of risk. A debt ratio of greater than 1 indicates that a company has more debt than assets. VAR 5 and VAR 7 present to negative
43 symptom – as the higher value they are, the worse financial situation of the firm is. In this group, two remained ratios are Retained earnings to sales (VAR6) and Times interest earned (VAR8) are considered as positive symptom because retained earnings are typically reinvested into the company, so, the higher value it is, the better. VAR8 is a great tool when measuring a company's ability to meet its debt obligations. When the interest coverage ratio is smaller than 1, the company is not generating enough cash from its operations EBIT to meet its interest obligations, and it is a warning sign when interest coverage falls below 2.5x.
The third group, profitability ratios: measure the overall performance, or returns, which management has been able to achieve. Profit is a crucial goal of a firm, so poor performance of a firm signals an imminent collapse. In this profitability group, Accounts payable turnover (VAR 11) shows investors how many times per period the company pays its average payable amount. If the turnover ratio is decreasing from one period to another, this is a sign that the company is taking longer time to pay off its suppliers than it was before. The opposite is true when the turnover ratio is rising, which means that the company is paying of suppliers at a faster rate. A higher value of the ratio implies a greater exposure to financial risk, therefore, VAR 11 is considered as negative symptom. Other ratios in this group stand for positive symptoms; for example:
A high accounts receivable turnover ratio (VAR10) indicates a tight credit policy, meanwhile, low or declining accounts receivable turnover ratio indicates a collection problem, part of which may be due to bad debts. A higher fixed-asset turnover ratio (VAR15) shows that the company has been more effective in using the investment in fixed assets to generate revenues.
44 The last category, activity ratios: measure the intensity with which the firm uses assets in generating sales and show how well a company has been using its resources.
These ratios indicate whether the firm’s investment in current and long-term assets is too large, too small, or just right. If too large, funds may be tied up in assets that could be used more productively. There are two basic approaches to the computation of activity ratios: the first looks at the average performance of the firm over the year and the second uses year-end balances in the calculations. As can be seen in the table 4.2, four selected ratios represent activity ratios are ROA, ROE, ROS and Profits to net working capital. Improvement in the ROE ratio (VAR17) implies improved marketing, improved productivity, or improvement in both, and thus requires further investigation.
Improvement in the ROA ratio (VAR16) implies a strengthening of marketing effectiveness. Whereas, ROS (VAR 18) is helpful to management, providing insight into how much profit is being produced per dollar of sales. An increasing ROS indicates the company is developing more efficient, while a decreasing ROS could signal looming financial troubles. Four represented ratios of profitability category are recognized as positive symptom.
4.4 Data analysis process
In order to add time series information to the model of business failure prediction, Visual Basic Application (VBA) embedded in Microsoft EXCEL 2003 was utilized to build model of grey analysis prediction. Prediction value of the four year and history data were computed and plotted in to the ROC curve. After that, by comparing the area under ROC curve with the area under ROC of the previous research which used the same history data, some comparisons, conclusions and announcements are pointed out respectively based on the work results.
45 4.5 Summary
This chapter illustrated the input data collection process. Firstly, the financial statements and the actual default or non- default situation of 92 construction companies are collected according to Standard Industry Classification. Secondly, the 19 selected financial ratios are taken into account based on some special principles and are divided into four typical groups rely on its characteristics. The procedure of data analysis is also depicted to demonstrate clearly all key steps.
Figure 4.1: The algorithm chart of the data analysis process 4 year continuing data θ =
0.5
Compute default probability
Utilize ROC curve.
Compare predictive accuracy by AUC
Make conclusions.
46
CHAPTER 5: DATA ANALYSIS AND RESULTS
5.1 Data analysis
5.1.1 Example analysis
As mentioned in the chapter 3- Methodology, the analysis was conducted at a single firm level then the same proceed with each of the consider firms will be taken into analysis. In order to better understand the proposed model, the writer develops a numerical example below with an assumption that analyze a set of F = 15 firms, for five
As mentioned in the chapter 3- Methodology, the analysis was conducted at a single firm level then the same proceed with each of the consider firms will be taken into analysis. In order to better understand the proposed model, the writer develops a numerical example below with an assumption that analyze a set of F = 15 firms, for five