Cost-Sensitive Based Model - 以文字探勘為基礎之財務風險分析方法研究

^incorrectli fide noncompli incompat inasmuch encumbr unreimburs brilliant disappoint cutback malfunct indefeas *concern sever disput benefici unabl wherebi discontinu sureti delist default forbear deficit amend

#Occurrence of the top 10 weigited terms among 6 models

Ranking Regression

Figure 4.4: Number of Occurrences of the Top 10 Weighted Terms Learned

Task(Features) 2001 2002 2003 2004 2005 2006 Mirco-avg

Ranking

Kendall’s Tau

SEN 0.62537 0.62815 0.60436 0.58887 0.60226 0.58605 0.60584 COST 0.62557 0.63328 0.604504 0.58995 0.60950 0.58768 0.60841

Spearman’s Rho

SEN 0.62538 0.62815 0.60437 0.58887 0.60227 0.58605 0.60585 COST 0.65601 0.66352 0.63537 0.61922 0.63745 0.61369 0.63754

Table 4.4: Experimental Results of Sentiment Based and Cost-Sensitive Based Methods

contrast, since only sentiment words were used to train the model, it is more reasonable that these terms are more highly related to financial risk.

4.3 Cost-Sensitive Based Model

The models that were trained on the sentiment lexicon are listed in SEN. COST is the cost-sensitive approach based on SEN. Table (4.4) lists the results obtained with SEN and COST. Because of the restrictions of the collected data, there are some observable differences between the SEN and COST training data. As a result, the SEN baseline data presented in Table (4.3) differs from that in Table (4.4). However, all baselines were produced in the same way. The best results are presented in boldface in Table (4.4). All of the results in the COST row performed better than those in the SEN row. This result shows a statistically significant improvement when using the proposed model.

‧

Figure 4.5: Highly-Weighted Terms Learned from the 6 Ranking Models of Using Origi-nal Texts (ORG) and Only Sentiment Words (SEN)

Influence of Weight on Ranking Pairs

We used Equation (3.10) to calculate expected values and to rank the pairs based on these values. Thus, the impact of misclassifying high-risk level companies is greater than that of misclassifying low-risk level companies. For example, the chance of errors occurring at high-risk levels is greater than those occurring at low-risk levels. We assume that the dif-ferences are 0.5 and 0.1 and that the number of pairs in the range 0.5 would be more than those in the the range of 0.1. The errors that occur at high-risk levels produce more discor-dant pairs. In cost-sensitive learning techniques, the weight of an instance represents its importance in the training data. As described in Equation (3.6), companies are assigned different risk levels according to the volatilities of their stock returns. Risky companies are assigned a higher weight based on the relationships described in Equation (3.8).

Analysis of Instance Weight

Based on Equations (3.8) and (3.9), the influence of positive and negative returns was evaluated. To determine the significance of the result, permutation tests were conducted with different parameter settings. Negative and positive returns were assigned different multipliers to verify whether there were differences in their respective strengths. In these

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Ϭ Ϭ͘ϬϮ Ϭ͘Ϭϰ Ϭ͘Ϭϲ Ϭ͘Ϭϴ Ϭ͘ϭ Ϭ͘ϭϮ Ϭ͘ϭϰ

WͲǀĂůƵĞ

Figure 4.6: Comparison of the P-values of the Cost-Sensitive model in different weight and Sentiment model

experiments, the summation of α and β is 200; there are a total of 201 tests (α = 0, β = 200), (α = 1, β = 199), ... ,(α = 200, β = 0). Figure (4.6) illustrates a part of the 201 tests (from (β = 102,α = 98) to (β = 152, α = 48)). By setting Equations (3.8) and (3.9) as the weight of instances, consistently good results were obtained, which out-perform those trained with SEN. In this section, all of the COST results are better than the SEN results and 60% of them passed the permutation tests with a p-value of 0.05.

In general, these results demonstrated that using daily returns as cost weights may be an effective way to combine soft and hard information instead of placing both of them into a document vector. We also verified that ”The literature documents that low stock returns are associated with increased volatility” [1]. Furthermore, we observed that the number of results where the Cost-Sensitive model outperformed Sentiment model was higher in the section when β > α than when β < α. Our results were identical to those from the financial analysis.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 5 Conclusions

In conclusion, the first part of this study demonstrates the importance of the sentiment lexicon in financial reports associated with financial risks. The sentiment method uses much few words to train the SVM models (i.e., regression and ranking) and is capable of producing comparable results. That is, finance-specific lexicon is able to better represent the occurrence of the terminology that is commonly found in financial reports. It also provides more meaningful information to investors. Based on the results that were trained with ORG and SEN, ORG identified representative words like ”ceg”, ”nasdaq”. ”gnb”, etc., SEN provided representative words like ”amend”, ”deficit”, ”profit”, and so forth.

The terms identified with SEN were considered to be more intuitive than the words in ORG to explain why these words are associated to the stock risk.

The second part proposes a new approach to combine the soft and hard information.

The cost-sensitive ranking task is transformed to a cost-sensitive multi-class classifica-tion problem, and then ranks the problem by the expected values. Besides, we examined the influence of the learned weights, and we also found that ”low stock returns are as-sociated with increased volatility” which is consistent with the descriptions in [1]. The cost-sensitive ranking method can result in a significant improvement. After all, theses findings provide us more ideas and understanding about the soft and hard information in financial reports. In addition, the results suggest that the hard information can be uti-lized as the cost weights of learning techniques. In future work, we plan to extend those sentiment words associated with the stock risks. In addition, we will also explore how to conduct deep-learning techniques on this problem, in order to learn more fine-grained financial sentiment lexicons.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

[1] J. Bae, C.-J. Kim, and C. R. Nelson. Why are stock returns and volatility negatively correlated? Journal of Empirical Finance, 14(1):41–58, 2007.

[2] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.

[3] K.-T. Chen, T.-J. Chen, and J.-C. Yen. Predicting future earnings change using numeric and textual information in financial reports. In Intelligence and Security Informatics, pages 54–63. Springer, 2009.

[4] N. Chen, A. S. Vieira, J. Duarte, B. Ribeiro, and J. C. Neves. Cost-sensitive learn-ing vector quantization for financial distress prediction. In Progress in Artificial Intelligence, pages 374–385. Springer, 2009.

[5] R. Engle. Risk and volatility: Econometric models and financial practice. American Economic Review, pages 405–420, 2004.

[6] R. Feldman. Techniques and applications for sentiment analysis. Communications of the ACM, 56(4):82–89, 2013.

[7] G. P. C. Fung, J. X. Yu, and W. Lam. Stock prediction: Integrating text mining approach using real-time news. In Computational Intelligence for Financial Engi-neering, pages 395–402. IEEE, 2003.

[8] D. Garcia. Sentiment during recessions. The Journal of Finance, 68(3):1267–1300, 2013.

[9] T. Joachims. Making large scale svm learning practical. Technical report, Universit¨at Dortmund, 1999.

[10] T. Joachims. Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 217–226. ACM, 2006.

‧

[11] S. Kogan, D. Levin, B. R. Routledge, J. S. Sagi, and N. A. Smith. Predicting risk from financial reports with regression. In The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 272–280.

ACL, 2009.

[12] A. J. Lee, M.-C. Lin, R.-T. Kao, and K.-T. Chen. An effective clustering approach to stock market prediction. In Pacific Asia Conference on Information Systems, pages 345–354, 2010.

[13] H.-T. Lin. A simple cost-sensitive multiclass classification algorithm using one-versus-one comparisons. National Taiwan University, Tech. Rep, 2010.

[14] T. Loughran and B. McDonald. When is a liability not a liability? textual analysis, dictionaries, and 10-ks. The Journal of Finance, 66(1):35–65, 2011.

[15] S. M. Mohammad and P. D. Turney. Emotions evoked by common words and phrases: Using mechanical turk to create an emotion lexicon. In Workshop on Com-putational Approaches to Analysis and Generation of Emotion in Text, pages 26–34.

ACL, 2010.

[16] R. Narayanan, B. Liu, and A. Choudhary. Sentiment analysis of conditional sen-tences. In Conference on Empirical Methods in Natural Language Processing, vol-ume 1, pages 180–189. ACL, 2009.

[17] A. Nikfarjam, E. Emadzadeh, and S. Muthaiyah. Text mining approaches for stock market prediction. In International Conference on Computer and Automation Engi-neering, volume 4, pages 256–260. IEEE, 2010.

[18] A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion min-ing. In Language Resources and Evaluation Conference, volume 10, pages 1320–

1326, 2010.

[19] B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1-2):1–135, 2008.

[20] M. A. Petersen. Information: Hard and soft. Technical report, working paper, North-western University, 2004.

[21] R. P. Schumaker and H. Chen. Textual analysis of stock market prediction using breaking financial news: The azfin text system. ACM Transactions on Information Systems, 27(2):1–29, 2009.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

[22] A. Smola and V. Vapnik. Support vector regression machines. Advances in neural information processing systems, 9:155–161, 1997.

[23] S. Takahashi, M. Takahashi, H. Takahashi, and K. Tsuda. Analysis of stock price return using textual data and numerical data through text mining. In Knowledge-Based Intelligent Information and Engineering Systems, pages 310–316. Springer, 2006.

[24] M.-F. Tsai and C.-J. Wang. Risk ranking from financial reports. In Advances in Information Retrieval, pages 804–807. Springer, 2013.

[25] R. S. Tsay. Analysis of financial time series, volume 543. John Wiley & Sons, 2005.

[26] B. Wuthrich, V. Cho, S. Leung, D. Permunetilleke, K. Sankaran, and J. Zhang. Daily stock market forecast from textual web data. In International Conference on Sys-tems, Man, and Cybernetics, volume 3, pages 2720–2725. IEEE, 1998.

在文檔中以文字探勘為基礎之財務風險分析方法研究 - 政大學術集成 (頁 37-0)

Cost-Sensitive Based Model

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 5 Conclusions

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

‧

‧

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學