3.3 Problem Formulation
3.3.3 Cost-Sensitive Task
tures contained in the documents. The ranking model considers the relative order based on all the training instances rather than on the explicit value assigned at a specific instance.
In this part of the study, we explore the relative risks among the selected companies.
Our goal is to rank companies based on the stock return volatilities presented in their respective financial reports. Following the work in [24], let Stbe the price of a stock at time t. Holding the stock for a defined period from time t - 1 to time t would result in a simple net return of Rt = St/St−1 [25]. The volatility of returns for a stock from time t - n to t can be defined as follows:
The volatilities of n stocks are then classified into 2` + 1 risk levels, where n, ` ∈ 1, 2, 3, ... . Let m be the mean of the sample and s be the standard deviation of the loga-rithm of volatilities of n stocks (i.e., denoted as ln(v)). The distribution over ln(v) across companies tends to be characterized by a bell shape [11]. Therefore, given a volatility v, we derive the risk level r as
r = the volatilities of company stock returns within a year as different risk levels, which can be considered as the relative difference of risk among the companies.
After classifying the volatilities of stock returns (of companies) into different risk levels, the ranking task can be defined as follows: Given a collection of financial reports D, we aim to rank the companies via a ranking model f : Rp → R such that the rank order of the set of companies is specified by the real value that the model f takes. In specific, f (di) > f (dj) is taken to mean that the model asserts that ci cj, where ci cj means that ci is ranked higher than cj ; that is, the company ci is more risky than cj . In this paper, we adopt Ranking SVM [10] for the ranking task.
3.3.3 Cost-Sensitive Task
As described in [11], previous studies included both soft and hard information (i.e., the twelve months before the report volatility for each company) in document vectors. More
‧
specifically, this method involves the inclusion and evaluation of soft (words) information with hard (volatility) information together. To accomplish this, both types of information are transformed into representative decimal values in document vectors, but they remain essentially different. We are curious about the viability of combining hard and soft infor-mation, and believe that there should be a better way to combine them.
When an investor plans to buy stocks, he would like to evaluate his investment risk. It is difficult to evaluate risk as a precise real number; most risks are evaluated at relative lev-els. Therefore, we proposed an approach to address this problem using the cost-sensitive ranking concept. That is, we utilized hard information as the cost weights of cost-sensitive learning techniques.
Problem Transformation
Due to the limitations of existing cost-sensitive tools, we applied a classification technique instead of using traditional ranking models for the cost-sensitive task. Cost-sensitive rank-ing is transformed to a task that predicts the level of risk level in companies and then ranks them based on their expected risk levels. High-risk companies were selected to verify if the associated risk levels could be classified correctly. The 5 classes of risk levels were based on the stock return volatilities of the previous year, where 5 represented the highest risk level and 1 represented the lowest risk level. Under most conditions, the estimation errors that occurred at higher risk levels were worse than those that occurred at lower risk levels. For instance, if the prediction error is 5%, the expected value of risk level 1 and 5 would be 0.05 and 0.25, respectively. More incorrectly ranked pairs are observed when riskier companies are misclassified. To focus on risky companies, the cost-sensitive SVM classifier was applied. The cost-sensitive SVM classification was formulated as follows:
min
where C+and C−are regularization parameters for positive and negative classes, respec-tively. The following section describes how the weights were tuned.
Weight Tuning
In cost-sensitive models, all instances (companies) are assigned a weight for the training process. The weight was determined with Equations (3.8) and (3.9). Given a
collec-‧
a trading day of a specific company i and cpd ∈ R+ is the closing price for the day the financial report was published. DRi is a collection of daily returns of company i, and DRi = {drd−(n−1), drd−(n−2), . . . , drd}, where drn∈ R is the daily return on day n. It is calculated with Equation (3.1), based on the CPi of each company i. Given a collection of company weights CW = {cw1, cw2, . . . , cwi}, where cwi is the weight of company i and cwi ∈ R+. The polarity weight pw, controls the different weights for positive and negative returns. Expected values are calculated with Equation (3.8) to rank financial risk for two reasons. Firstly, volatility is associated with financial risk [5]; high-risk compa-nies are readily assigned high weight values when the polarity weight is not considered.Therefore, the higher the volatility, the higher the assigned weight. This particular con-figuration focuses on high-risk companies, which improves the predictive accuracy of high-risk levels and effectively reduces the total number of prediction errors.
Secondly, inspired by [1], we examine whether low stock returns are associated with increased volatility with Equation (3.9).
cwi = High and low daily returns is a relative concept in our model; hence, negative daily returns are considered as low daily returns. To identify whether financial risk correlates with negative returns, the variant polarity weight pw, which is controlled by α and β (α ∈ R+ and β ∈ R+), is multiplied to both negative and positive returns to observe the respective effects on the two variables. If the absolute value of the positive returns is greater than the absolute value of the negative returns in DRi, cwiincreases when α < β.
In contrast, cwi decreases when α > β. Companies are assigned different weight based on CW. The cost-sensitive model then focuses on the companies with higher assigned weights to reduce the error penalty.
Pair Ranking
The working flows are used to 1) assign risk levels to the selected companies based on their stock return volatilities, 2) set representative weights based on weight tuning, 3) use cost-sensitive classification outputs to calculate expected values, and 4) rank the selected companies based on the expected values. To obtain expected values for ranking pairs, the expected risk level of each company is calculated as follows:
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
EXPComapnyi =
5
X
n=1
P ron· n, (3.10)
where P ron represents the chance that company i was classified at risk level n. The probability n is the output of the cost-sensitive classification algorithm.
Generally, the use of classification results to calculate the expected value is inappro-priate because classification addresses the problem of identifying which category a new observation belongs to. There are no numerical correlations among classification cate-gories. Fortunately, the categories defined in our study also represent the risk levels of companies. Consequently, the application of Equation (3.10) is considered to be reason-able in our case.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Chapter 4
Experimental Results
4.1 Experimental Settings
In our experiments, the target value is the volatility of the stock return of the companies in the next year, and our training data is comprised of soft and hard information presented in the financial reports. Specifically, the entire training data is obtained from financial reports published in the past five years, the reports of the following year is defined as the test data. That is, if the training data consist of reports from 1996 to 2000, the test data is defined as the reports publish in 2001. While previous studie focused on the applica-tion of regression [11], our work extends the use of this technique to include a ranking approach. In addition, we proposed additional considerations for sentiment and cost-sensitive approaches. These approaches are illustrated in the Figures (4.1, 4.2 and 4.3).
Comparisons were conducted between the proposed and standard approaches and will be further described in this section.
Figure 4.1: Training instances in Regression Model
Figure 4.2: Training instances in Ranking Model
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 4.3: Training instances in Cost-sensitive Ranking Model