• 沒有找到結果。

以文字探勘為基礎之財務風險分析方法研究 - 政大學術集成

N/A
N/A
Protected

Academic year: 2021

Share "以文字探勘為基礎之財務風險分析方法研究 - 政大學術集成"

Copied!
45
0
0

加載中.... (立即查看全文)

全文

(1)國立政治大學資訊科學系 Department of Computer Science National Chengchi University 碩士論文 Master’s Thesis. 立. 政 治 大. ‧ 國. 學 ‧. 以文字探勘為基礎之財務風險分析方法研究 Approaches. n. al. Ch. engchi. er. io. sit. y. Nat. Exploring Financial Risk via Text Mining i Un. v. 研 究 生:劉澤 指導教授:蔡銘峰. 中華民國 一百零四 年 七 月 July 104.

(2) 104. 碩 士 論 文. 立. 政 治 大. ‧. ‧ 國. 學. 以 文 字 探 勘 為 基 礎 之 財 務 風 險 分 析 方 法 研 究. n. er. io. sit. y. Nat. al. 政 治 大 學 資 訊 科 學 系. 劉 澤. Ch. engchi. i Un. v.

(3) 以文字探勘為基礎之財務風險分析方法研究 Exploring Financial Risk via Text Mining Approaches 研 究 生:劉澤 指導教授:蔡銘峰. Student:Tse Liu Advisor:Ming-Feng Tsai. 國立政治大學 資訊科學系 碩士論文. 立. 政 治 大. ‧ 國. 學. A Thesis submitted to Department of Computer Science. ‧. n. al. er. io. sit. y. Nat. National Chengchi University in partial fulfillment of the Requirements for the degree of Master iv n C h in engchi U Computer Science. 中華民國 一百零四 年 七 月 July 104.

(4) 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 2. i Un. v.

(5) 致謝. 在本論文即將完成之際,心中充滿感謝與感動,回首過去的求學歷 程,受到許多師長與朋友的幫助與鼓勵,過程雖然歷經許多困難,但 結果帶來的成長令我倍感珍惜。本論文得以順利完成,首先要特別感. 治 政 大 明確的方向並給予我許多寶貴的建議,過程中也給予我在研討會發表 立 論文的經驗。除了學術的專業外,蔡老師也教導我許多待人處事的道. 謝我的指導教授蔡銘峰老師,在論文寫作與研究討論過程中,指引我. ‧ 國. 學. 理,對於我培養正確的人格態度,提攜之情由衷感謝。. 在論文口試期間,特別感謝王釧茹教授與沈錳坤教授,在百忙之中 撥冗審閱我的論文,並給予建議,使本論文內容更加完整。感謝志明. ‧. y. Nat. 學長在研究與求學過程中的協助與指導,幫助我完成許多挑戰,由衷 感謝;也要感謝我的實驗室好夥伴:禔多、哲立與心萍,在研究生活. sit. 上的互相照應與扶持。最後,我要感謝我摯愛的家人,一路栽培我至. er. io. 研究所畢業,也感謝家人對我的照顧與包容,順利完成我研究所學習 生涯,願你們永遠健康、快樂。. n. al. Ch. engchi. i Un. v. 劉澤 國立政治大學資訊科學系 July 2015. 3.

(6) 以文字探勘為基礎之財務風險分析方法研究. 中文摘要 近年來有許多研究將機器學習應用於財務方面的股價走勢與風險預 測。透過分析股票價格、財報的文字資訊、財經新聞或者更即時的推 特推文,都有不同的應用方式可以做出一定程度的投資風險評估與股 價走勢預測。在這篇論文中,我們著重在財務報表中的文字資訊,並 利用文字資訊於財務風險評估的問題上。我們以財報中的文字資訊預 測上市公司的風險程度,在此論文中我們選用股價波動度作為衡量財 務風險的評量方法。在文字的處理上,我們首先利用財金領域的情緒 字典改善原有的文字模型,情緒分析的研究指出情緒字能更有效率地 反應文章中的意見或是對於事件的看法,因而能有效地降低文字資訊. 政 治 大 重的方式將股價與投資報酬率等數值資訊帶入機器學習模型中,在學 立. 的雜訊並且提升財報文字資訊預測時的準確率。其次,我們嘗試以權. 習模型時我們根據公司財報中的數值資訊,給予不同公司財報中的文. ‧ 國. 學. 字資訊權重,並且透過不同權重設定的支持向量機將財報中的文字資 訊結合。根據我們的實驗結果顯示,財務情緒字典能有效地代表財報. ‧. 中的文字資訊,同時,財務情緒字與公司的風險高度相關。在財務情 緒字以權重的方式將股價與投資報酬率結合的實驗結果中,數值資訊. n. al. er. io. sit. y. Nat. 顯著地提升了風險預測的準確率。. Ch. engchi. 4. i Un. v.

(7) Exploring Financial Risk via Text Mining Approaches. Abstract In recent years, there have been some studies using machine learning techniques to predict stock tendency and investment risks in finance. There have also been some applications that analyze the textual information in financial reports, financial news, or even twitters on social network to provide useful information for stock investors. In this paper, we focus on the problem that uses the textual information in financial reports and numerical information of companies to predict the financial risk. We use the textual information in financial report of companies to predict the financial risk in the following year. We utilize stock volatility to measure financial risk. In the first part of the thesis, we use a finance-specific sentiment lexicon to improve the prediction models that are trained only textual information of financial reports. Then we also provide a sentiment analysis to the results. In the second part of the thesis, we attempt to combine the textual information and the numerical information, such as stock returns to further improve the performance of the prediction models. In specific, in the proposed approach each company instance associated with its financial textual information will be weighted by its stock returns by using the cost-sensitive learning techniques. Our experimental results show that, finance-specific sentiment lexicon models conduct comparable performance to those on the original texts, which confirms the importance of financial sentiment words on risk prediction. More importantly, the learned models suggest strong correlations between financial sentiment words and risk of companies. In addition, our cost-sensitive results significantly improve the cost-insensitive results. As a result, these findings identify the impact of sentiment words in financial reports, and the numerical information can be utilized as the cost weights of learning techniques.. 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 5. i Un. v.

(8) 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 6. i Un. v.

(9) Contents 致謝. 3. 中文摘要. 4. Abstract. 5. 政 治 大. 1. Introduction. 2. Related Work 2.1 Financial Risk Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Cost-Sensitive Classification . . . . . . . . . . . . . . . . . . . . . . . .. 3. Methodology 3.1 Definition of Financial Terms . . . . 3.1.1 Daily Stock Returns . . . . 3.1.2 Stock Return Volatility . . . 3.2 Finance-Specific Sentiment Lexicon 3.3 Problem Formulation . . . . . . . . 3.3.1 Regression Task . . . . . . 3.3.2 Ranking Task . . . . . . . . 3.3.3 Cost-Sensitive Task . . . . .. 立. n. 5. Ch. . . . . . . . .. engchi. . . . . . . . .. . . . . . . . .. . . . . . . . .. i Un. Experimental Results 4.1 Experimental Settings . . . . . . . . . . . . . . . 4.1.1 Dataset . . . . . . . . . . . . . . . . . . 4.1.2 Extracted Features . . . . . . . . . . . . 4.1.3 Evaluation Metrics . . . . . . . . . . . . 4.1.4 Parameter Settings . . . . . . . . . . . . 4.2 Finance-Specific Sentiment Lexicon Based Model 4.3 Cost-Sensitive Based Model . . . . . . . . . . . Conclusions. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. 17 17 18 18 20 20 21 23. y. . . . . . . . .. sit. . . . . . . . .. . . . . . . .. v. 5 5 6 6 9 9 9 10 10 12 12 12 13. er. io. al. . . . . . . . .. ‧. ‧ 國. 學. Nat. 4. 1. 27. Bibliography. 29. 7.

(10) 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 8. i Un. v.

(11) List of Figures 4.1. Training instances in Regression Model . . . . . . . . . . . . . . . . . .. 17. 4.2. Training instances in Ranking Model . . . . . . . . . . . . . . . . . . . .. 17. 4.3. Training instances in Cost-sensitive Ranking Model . . . . . . . . . . . .. 18. 4.4. Number of Occurrences of the Top 10 Weighted Terms Learned . . . . .. 政 from治 Highly-Weighted Terms Learned the 6 Ranking 大 Models of Using Original Texts (ORG) 立and Only Sentiment Words (SEN) . . . . . . . . .. 23 24. Comparison of the P-values of the Cost-Sensitive model in different weight. 學. and Sentiment model . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ‧. io. sit. y. Nat. n. al. er. 4.6. ‧ 國. 4.5. Ch. engchi. 9. i Un. v. 25.

(12) 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 10. i Un. v.

(13) List of Tables 3.1. Financial Sentiment Dictionary . . . . . . . . . . . . . . . . . . . . . . .. 11. 4.1. Statistics of the Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19. 4.2. Statistics of the Financial Sentiment Lexicon. . . . . . . . . . . . . . . .. 19. 4.3. Experimental Results of Using Original Texts and Only Sentiment Words. 21. 學 ‧. ‧ 國 io. sit. y. Nat. n. al. er. 4.4. 政 治 大 Experimental Results of Sentiment Based and Cost-Sensitive Based Meth立 ods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Ch. engchi. 11. i Un. v. 23.

(14) 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 12. i Un. v.

(15) Chapter 1 Introduction 政 治 大. In recent years, text sentiment analysis has become a popular research area because of the explosion of sentiment information from social web site, such as [15, 16]. There. 立. have been more and more studies attempting to use machine learning techniques to pre-. ‧ 國. 學. dict stock tendency and investment risk in finance. Financial risks are uncertainties associated with any form of financing, including credit risk, business risk and investment risk. Financial risk prediction is an important domain of financial analysis since it can. ‧. help investors detect financial risks in advance and take appropriate actions to minimize. y. Nat. their loss. Market sentiment is a phase in finance that describes general attitude of in-. sit. vestors as to anticipate price development in a market. For example, if investors expect. al. er. io. upward price movement in the stock market, the sentiment is said to be bullish. There-. v. n. fore, market sentiment can be acquired from news, twitter or financial reports based on. Ch. i Un. the textual information which is contained in. For most sentiment analysis algorithms,. engchi. as mentioned in [6], the sentiment lexicon is the most important resource. In [14], the Harvard Psychosociological Dictionary, a common dictionary for general sentiment analysis, is extended to be a finance-specific sentiment lexicon. As shown in [14], almost three-fourths of the words in the 10-K financial reports from 1994 to 2008, which are classified differently by general purpose sentiment dictionary (Harvard Psychosociological Dictionary) and finance-specific lexicon. Motivated by the importance of financial risk prediction and text sentiment analysis, we utilize a finance-specific sentiment lexicon to conduct the textual analytics of financial reports on risk prediction in this study. We use two kinds of information, soft and hard information [20], to predict financial risk in this paper. Soft information usually refers to text, including opinions, ideas, and market commentary, whereas hard information is always recorded as numbers, such as financial measures in finance reports. We attempt to predict the financial risk of companies by using the textual and numerical information in their financial reports. In this work, the risk measurement we use is the volatility of stock returns. Volatility is a measurement 1.

(16) for variation of price of a financial instrument over time. Historic volatility is derived from the time series of past market prices. Volatility can also be understood easily, and high volatility means its price is not stable. For an investor who bought risky product that higher volatility means a greater chance of a shortfall. And price volatility also presents opportunities to buy assets cheaply and sell when overpriced. This paper attempts to mine financial reports to predict financial risk. To accomplish our goal, our approaches are divided into two major parts. 1) In order to analyze financial risk with finance-specific sentiment lexicon, we use only finance-specific sentiment lexicon to represent the soft information in financial reports. Because sentiment analysis is the process of identifying people’s opinions, we believe that finance-specific sentiment lexicon can help capture the attitudes and emotional states of writer in financial reports. Then we have some interesting sentiment analysis. 政 治 大 in both regression and ranking experiments with the learning techniques of Support Vector 立. based on this assumption. In the thesis, the sentiment analysis experiments are conducted Machines (SVM) [2]. We attempt to use soft information to explore the risk of companies.. ‧ 國. 學. Then we have some analysis that compared the models trained on only finance-specific sentiment lexicon occurring in financial report and original texts in financial report, which. ‧. shows that sentiment lexicon presents the textual information more efficiently. Moreover, most of the highly ranked terms are highly related to financial risk in the sentiment lexicon. y. Nat. sit. model. This makes the sentiment lexicon models more reasonable and readable.. al. er. io. 2) To incorporate the hard and soft information for financial risk predictions, we pro-. n. pose a new approach to combine these two different types of information. In general, soft. Ch. i Un. v. and hard information are mostly used as features in the related fields. That is, most of. engchi. previous works place both soft and hard information in a document vector. However, we believe that soft and hard information are different in essence, and there should be other ways to incorporate these two different types of information. Considering that volatility shows some of the characteristic of a stock; for example, high volatility usually associate with low stock price. We use the volatility of a stock to represent its individualism. To achieve the goal, we set the instance of every company with different weight based on its hard information (volatility) in the past year. We employ the volatility as the weight of each instance in the cost-sensitive SVM with the document vector built via the soft(text) information. According to the results of our first work, since the highly weighted terms are more consistent with the ranking models than the regression models. For the second task, we only conduct experiments on the ranking models. Due to the limitation of existing tools, we transform this cost-sensitive ranking problem to a cost-sensitive multi-class classification problem. In this paper, we use the 10-K Corpus [11] to conduct our experiments. This corpus 2.

(17) contains many financial reports providing a comprehensive overview of the company’s business and financial conditions during the period of 1996-2006. We use a financespecific lexicon that consists of the 6 word lists to represent the soft information in financial reports. The 6 word lists include 1,546 unique words. The hard information of companies is crawled from Wharton Research Data Services and be connected with companies in financial reports by their CUSIP codes. We only select the companies with complete hard information in a year, which results in 6,649 companies in total. Finally we compare the performance of the models trained on the original texts with those on only sentiment words. The results of using only sentiment words, in most cases, perform better than those of using the original texts. In addition, our cost-sensitive models significantly improve over the cost-insensitive models. As a result, these findings identify the impact of sentiment words in financial reports, and the numerical information can be utilized as. 政 治 大. the cost weights of learning techniques.. 立. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 3. i Un. v.

(18) 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 4. i Un. v.

(19) Chapter 2 Related Work Risk prediction is an important issue in commercial business. It has been studied for years,. 政 治 大 used to extract useful information from data, upon which predictions can be made. More 立 recently, machine learning provides some promising ways to perform risk predictions, especially in the financial sector. In computer science, machine learning techniques are. ‧ 國. 學. which helps us to further analyze textual information in detail. To better capture the textual information, sentiment analysis was conducted to explore the relations between. ‧. texts in financial reports and risks. The possibility of combining hard and soft information was also explored in this study, which forms the basis for and inspired subsequent work. y. Nat. io. er. are reviewed.. sit. on cost-sensitivity. In this section, different perspectives from selected, relevant studies. n. al. ni Ch 2.1 Financial Risk Prediction U engchi. v. In recent years, owing to open data and computational improvements, there have been more and more studies that use machine learning techniques to predict financial risks. Fung et al. [7] predicted the movement of stock prices based on news articles. Schumaker et al. [21] investigated a notable volume of financial news articles and stock quotes covering the S&P 500 stocks. The study also examined the application of a predictive machine learning approach for financial news articles analysis using several different textual representations. The other related studies include the prediction of short-term stock price movements after the release of financial reports [12] and the forecasting of the direction of stock index movements [17]. Computer scientists also attempted to apply a novel method that focuses on textual information in financial reports and newspapers rather than using traditional methods (i.e., assessments with cash instruments) to manage risk [26]. Other prediction approaches are based on the numerical information from financial reports [3]. As described in [23], they employ both numerical and textual information extracted from 5.

(20) analysts’ reports to predict stock price movements. Kogan et al. further formulated this risk prediction problem as a text regression problem [11]. These studies highlight examples of risk prediction processes in computer science. It can be observed that most models developed in these previous studies were based on various types of distinct features. We propose an approach that explores the effect of using a finance-specific sentiment lexicon to represent textual information.. 2.2. Sentiment Analysis. Sentiment analysis is the process of identifying people’s attitudes and emotional states from languages. Sometimes, sentiment analysis also refers to opinion mining. Sentiment words can reflect speakers’ personal opinions more effectively.. 政 治 大 review sites and personal blogs 立(e.g., authors of tweets who write about their life and With the growing availability and popularity of opinion-rich resources such as online. share opinions on a variety of topics and discuss current issues), the growing importance. ‧ 國. 學. of sentiment analysis has been recognized. Pak and Paroubek [18] built a classifier that was able to determine positive, negative and neutral sentiments in a given document. They. ‧. performed linguistic analysis of the collected twitter corpora and explained the findings. Pang et al. [19] focuses on methods that aim to address new challenges associated to sen-. y. Nat. sit. timent aware applications. This work helps in controlling other well-known time-series. er. io. patterns, and news content helps predict stock returns at the daily frequency. Garcia [8] studied the effect of sentiment on asset prices during the first half of the 20th century.. n. al. Ch. i Un. v. There has already been some research and applications dedicated to sentiment words.. engchi. Based on the associated results, the importance of sentiment words in text mining is evident. In finance, there have been studies [14] that apply textual analysis to examine the sentiment of numerous news items, articles, financial reports, and tweets about public companies. Motivated by the results and guidance provided by these previous studies, we conducted a sentiment analysis on financial risks.. 2.3. Cost-Sensitive Classification. Cost-sensitive classification is an approach that acknowledges the fact that some types of misclassifications may be worse than others. There are many cases in real life where penalties are applied when misclassifications occur. For instance, consider recommending music to a subscriber with a preference for jazz over popular music and least of all over country music. Under these conditions, the cost of incorrectly predicting a jazz composition as a country music should be significantly higher than the cost of misiden6.

(21) tifying it as a popular music. In our experiments, we assume that the cost is assigned based on the needs of the application. Lin [13] extended the one-versus-one approach to the field of cost-sensitive classification.As a result, better performance was observed with the improved approach than with the original one-versus-one cost-sensitive classification approach. Chen et al. [4] proposed a cost-sensitive learning vector quantization (LVQ) approach that incorporates cost information into the model. These applications can be applied to many cases in the field of finance. For instance, financial distress prediction is of crucial importance in credit risk analysis, with increasing competition and complexity in the credit industry. Accurate predictions that minimize or eliminate misclassification costs are particularly critical in various applications like credit risk analysis and fraud detection. Recently, an increasing number of studies have applied cost information in different applications. The cost-sensitive concept is also applied in. 政 治 大 those for low-risk companies. To verify this hypothesis, we focused on the risk associated 立 companies by using hard information contained in their financial reports as cost weights. our work. If high-risk companies were misclassified, the consequences are worse than. ‧ 國. 學. defined in cost-sensitive learning techniques.. ‧. n. er. io. sit. y. Nat. al. Ch. engchi. 7. i Un. v.

(22) 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 8. i Un. v.

(23) Chapter 3 Methodology 政 治 大. The proposed approach evaluates soft information on a finance-specific sentiment lexicon. Afterwards, the sentiment influence of the information on this experiment is analyzed.. 立. The result of this analysis is integrated with hard information to predict cost-sensitive risk. ‧ 國. 學. with ranking approach. The following subsections provide further details and reasons for the proposed approach.. ‧ y. sit. n. al. er. Daily Stock Returns. io. 3.1.1. Definition of Financial Terms. Nat. 3.1. Ch. i Un. v. Daily return is a ratio that assesses the profit or loss derived from an investment in a trad-. engchi. ing day. In business, a trading day is defined as the duration for which a particular stock exchange is open. Return on investment is a measure of investment performance used by both professional and novice investors. By dividing the loss or gain on an investment over the period of one day by the original cost of the investment, potential investors can compare investment opportunities by examining the daily return percentage rate. Daily returns provide a general overview of one’s investments, and the value indicates the associated gain or loss ratio. Positive daily return values represent gains, while negative daily return values represent investment loses. It is easy to interpret the results, which is why the value is widely used by investors to evaluate investments. In this study, we calculated the daily return Ri of a company on day i using Equation (3.1):. Ri =. Pi − Pi−1 , Pi−1. where Pi is the closing price on day i and Pi−1 is the closing price on day i − 1. 9. (3.1).

(24) 3.1.2. Stock Return Volatility. In finance, volatility is a common risk metric measured by the standard deviation of a stock’s returns over a period of time. Volatility was selected as a risk metric for the following reasons. It is closely related to the stock price because of its formulation. Volatility of investment is also closely connected to risks because it reflects the tendency of prices. A stock with high volatility presents opportunities to buy assets at a lower cost and sell them when they are overpriced because the price intensely fluctuate. A stock has low volatility when its price remains stable. In other words, volatility helps us to understand the characteristics of an investment over time. Another advantage is that stock prices are easy to track. Moreover, previous studies utilized stock return volatility as a risk metric [11]. We can extend the findings of the previous studies and compare the effectiveness of our approach with these findings.. 政 治 大. In this study, all returns throughout the trading days will be used to determine the. 立. stock return volatility for each of the companies of interest. Let St be the price of a stock. ‧ 國. 學. at time t. Holding the stock from time t − 1 to time t would result in a net return of Rt = St /St−1 − 1 [25]. Volatility of returns for a stock from time t − n to time t can be defined by Equation (3.2): − R)2. n. al. (3.2). er. Ri /(n + 1).. ,. y. i=t−n (Ri. sit. Pt. ‧. i=t−n. v[t−n,t] =. io. Pt. Nat. where R =. s. v. n. The average number of trading days is 220 days per year in the training data. Hence,. Ch. i Un. time t represents the date when the financial report was published, and the value of n is. engchi. 220. In this study, the average stock return volatility will also be considered as the weight of companies in cost-sensitive models. In the tuning weight section, the manner in which the weights are applied to cost-sensitive models will be described.. 3.2. Finance-Specific Sentiment Lexicon. For most sentiment analysis algorithms, a sentiment lexicon suited to the field is an essential resource. As mentioned in [14], a general-purpose sentiment lexicon may misclassify common words in financial texts. Consequently, we applied a finance-specific lexicon that consists of 6 word lists provided in [14] to analyze the relations between these sentiment words and financial risk. Based on the idea of sentiment analysis, we attempted to capture writer’s opinions in the financial reports that were evaluated. The working assumption is that sentiment words will be used when the writer has a strong feeling about something or wants to place an 10.

(25) Class. Meaning. Examples. Fin-Neg. Negative business terminologies. deficit, delist. Fin-Pos. Positive business terminologies. profit, integr. Fin-Unc. Words denoting uncertainty. doubt. Fin-Lit. Propensity for legal contest. amend, forbear. MW-Strong. Strong levels of confidence. must, best. MW-Weak. Weak levels of confidence. may, perhaps. Table 3.1: Financial Sentiment Dictionary. emphasis on certain phenomenon or events. Based on this assumption, sentiment words would be more powerful and precise than non-sentiment words. Therefore, the use of. 政 治 大 way of extracting information instead of using all of the soft information in financial 立 reports. To examine this assumption, we compared the results that were trained using a finance-specific sentiment lexicon to represent soft information would be an improved. ‧ 國. 學. only the finance-specific sentiment lexicon with the results that were trained with the use of all the words in the financial reports.. ‧. Some examples for different kinds of sentiment words in Table (3.1) Below we pro-. y. Nat. vide some original descriptions from 10-K reports that contained sentiment words have. n. al. er. io. from a paragraph is quoted from the original report as follows:. sit. been included. First, the term “amend” from the Fin-Lit list was considered. One excerpt. Ch. engchi. i Un. v. (from AGO, 2006 Form 10-K) On March 22, 2005, we amended the term loan agreements to, among other reasons, lower the borrowing rate by 25 basis points from LIBOR plus 2.00% to LIBOR plus 1.75%.. In finance, the term amend usually means “to change by some formal processes.” This term indicates that companies are frequently modifying their policies, and the practice is considered to be relatively high-risk. The descriptions show that amend is a keyword that captures the onset of important events. The identification of these kinds of financespecific terms might help us to capture important phenomenon and events more frequently and effectively. 11.

(26) 3.3 3.3.1. Problem Formulation Regression Task. Regression is a supervised learning technique in machine learning. Regression estimates the relations among variables with statistics. An example of an application of regression is the recognition of the relationship between the square footage and price of a house. Known records are used to train the regression model (square footage and price of a house) to capture the correlation between two variables. When the trained regression model is given the square footage of a house as an input, it can predict its price. In this study, we explored the relationships between financial risk and soft information presented in financial reports. By applying the regression technique, we can predict. 政 治 大. a company’s future risk based on it’s financial report from the previous past year. We. 立. formulated this problem as follows:. Given a collection of financial reports D = {d1 , d2 , ... , dn }, in which each di ∈ Rp. ‧ 國. 學. and is associated with a company ci , we seek to predict the company’s future risk, which is characterized by its volatility vi . Such a prediction can be defined by a parameterized. sit. Nat. vˆi = f (di ; w).. y. ‧. function f as follows:. (3.3). er. io. The goal is to learn a p-dimensional vector w from the training data T = {(di , vi )|di ∈. al. n. iv n C such a regression model. SVR is trained the following optimization problem: h ebynsolving gchi U. Rp , vi ∈ R}. Support Vector Regression (SVR) [22] is a popular technique for training. n 1 CX min V (w) = hw, wi + max(|vi − f (di ; w)| − , 0) w 2 n i=1. (3.4). where C is a regularization constant, n is the number of documents and  controls the training error.. 3.3.2. Ranking Task. Learning to rank is an important technique that is widely applied to retrieve information. Search engines returning documents related to user-defined keywords is a common application of ranking tasks to retrieve information. The documents that are returned are ordered based on the correlation between users’ queries and the documents. Documents are assigned a relative value that arranges them in a particular order within the ranking model. The ranking model aims to learn the relations between assigned values and fea12.

(27) tures contained in the documents. The ranking model considers the relative order based on all the training instances rather than on the explicit value assigned at a specific instance. In this part of the study, we explore the relative risks among the selected companies. Our goal is to rank companies based on the stock return volatilities presented in their respective financial reports. Following the work in [24], let St be the price of a stock at time t. Holding the stock for a defined period from time t - 1 to time t would result in a simple net return of Rt = St /St−1 [25]. The volatility of returns for a stock from time t - n to t can be defined as follows: s v[t−n,t] = where R =. Pt. Pt. i=t−n (Ri. n. − R)2. ,. (3.5). Ri /(n + 1).. 政 治 大 1, 2, 3, ... . Let m be the mean of the sample and s be the standard deviation of the loga立 rithm of volatilities of n stocks (i.e., denoted as ln(v)). The distribution over ln(v) across i=t−n. The volatilities of n stocks are then classified into 2` + 1 risk levels, where n, ` ∈. ‧ 國. 學. companies tends to be characterized by a bell shape [11]. Therefore, given a volatility v, we derive the risk level r as. (3.6). er. io. sit. y. ‧. Nat.   ` − k if ln(v) ∈ (a, m − usk]    r = ` if ln(v) ∈ (m − us, m + us),    ` + k if ln(v) ∈ [m + usk, b),. al. n. iv n C when k ∈ {1, ..., ` − 1}, b = ∞ whenhk = ` and u isia positive e n g c h U real number. We classify the volatilities of company stock returns within a year as different risk levels, which can where a = m−s(k+1) when k ∈ {1, ..., `−1}, a = −∞ when k = `, b = m+s(k+1). be considered as the relative difference of risk among the companies. After classifying the volatilities of stock returns (of companies) into different risk levels, the ranking task can be defined as follows: Given a collection of financial reports D, we aim to rank the companies via a ranking model f : Rp → R such that the rank order of the set of companies is specified by the real value that the model f takes. In specific, f (di ) > f (dj ) is taken to mean that the model asserts that ci  cj , where ci  cj means that ci is ranked higher than cj ; that is, the company ci is more risky than cj . In this paper, we adopt Ranking SVM [10] for the ranking task.. 3.3.3. Cost-Sensitive Task. As described in [11], previous studies included both soft and hard information (i.e., the twelve months before the report volatility for each company) in document vectors. More 13.

(28) specifically, this method involves the inclusion and evaluation of soft (words) information with hard (volatility) information together. To accomplish this, both types of information are transformed into representative decimal values in document vectors, but they remain essentially different. We are curious about the viability of combining hard and soft information, and believe that there should be a better way to combine them. When an investor plans to buy stocks, he would like to evaluate his investment risk. It is difficult to evaluate risk as a precise real number; most risks are evaluated at relative levels. Therefore, we proposed an approach to address this problem using the cost-sensitive ranking concept. That is, we utilized hard information as the cost weights of cost-sensitive learning techniques. Problem Transformation. 政 治 大 instead of using traditional ranking 立 models for the cost-sensitive task. Cost-sensitive rank-. Due to the limitations of existing cost-sensitive tools, we applied a classification technique ing is transformed to a task that predicts the level of risk level in companies and then ranks. ‧ 國. 學. them based on their expected risk levels. High-risk companies were selected to verify if the associated risk levels could be classified correctly. The 5 classes of risk levels were. ‧. based on the stock return volatilities of the previous year, where 5 represented the highest risk level and 1 represented the lowest risk level. Under most conditions, the estimation. y. Nat. sit. errors that occurred at higher risk levels were worse than those that occurred at lower risk. er. io. levels. For instance, if the prediction error is 5%, the expected value of risk level 1 and 5 would be 0.05 and 0.25, respectively. More incorrectly ranked pairs are observed when. n. al. Ch. i Un. v. riskier companies are misclassified. To focus on risky companies, the cost-sensitive SVM. engchi. classifier was applied. The cost-sensitive SVM classification was formulated as follows: X X 1 min wT w + C+ ξi + C− ξi ω,b,ξ 2 y y i=1. i=−1. +. subject to 0 ≤ αi ≤ C , if yi = 1,. (3.7). 0 ≤ αi ≤ C − , if yi = −1 yT α = 0. where C + and C − are regularization parameters for positive and negative classes, respectively. The following section describes how the weights were tuned. Weight Tuning In cost-sensitive models, all instances (companies) are assigned a weight for the training process. The weight was determined with Equations (3.8) and (3.9). Given a collec14.

(29) tion CPi = {cpd−n , cpd−(n−1) , . . . , cpd }, where CP is a collection of closing prices for a trading day of a specific company i and cpd ∈ R+ is the closing price for the day the financial report was published. DRi is a collection of daily returns of company i, and DRi = {drd−(n−1) , drd−(n−2) , . . . , drd }, where drn ∈ R is the daily return on day n. It is calculated with Equation (3.1), based on the CPi of each company i. Given a collection of company weights CW = {cw1 , cw2 , . . . , cwi }, where cwi is the weight of company i and cwi ∈ R+ . The polarity weight pw, controls the different weights for positive and negative returns. Expected values are calculated with Equation (3.8) to rank financial risk for two reasons. Firstly, volatility is associated with financial risk [5]; high-risk companies are readily assigned high weight values when the polarity weight is not considered. Therefore, the higher the volatility, the higher the assigned weight. This particular configuration focuses on high-risk companies, which improves the predictive accuracy of. 政 治 大 Secondly, inspired by [1], we examine whether low stock returns are associated with 立 increased volatility with Equation (3.9). d−1 P. |drn | · pwn. n=0. |DRi |. ‧. cwi =. 學. ‧ 國. high-risk levels and effectively reduces the total number of prediction errors.. y. n. al. (3.9). er. io. sit. Nat.  α, if drn < 0 pwn = β, if dr > 0. (3.8). n. High and low daily returns is a relative concept in our model; hence, negative daily. Ch. i Un. v. returns are considered as low daily returns. To identify whether financial risk correlates. engchi. with negative returns, the variant polarity weight pw, which is controlled by α and β (α ∈ R+ and β ∈ R+ ), is multiplied to both negative and positive returns to observe the respective effects on the two variables. If the absolute value of the positive returns is greater than the absolute value of the negative returns in DRi , cwi increases when α < β. In contrast, cwi decreases when α > β. Companies are assigned different weight based on CW. The cost-sensitive model then focuses on the companies with higher assigned weights to reduce the error penalty. Pair Ranking The working flows are used to 1) assign risk levels to the selected companies based on their stock return volatilities, 2) set representative weights based on weight tuning, 3) use cost-sensitive classification outputs to calculate expected values, and 4) rank the selected companies based on the expected values. To obtain expected values for ranking pairs, the expected risk level of each company is calculated as follows: 15.

(30) EXPComapnyi =. 5 X. P ron · n,. (3.10). n=1. where P ron represents the chance that company i was classified at risk level n. The probability n is the output of the cost-sensitive classification algorithm. Generally, the use of classification results to calculate the expected value is inappropriate because classification addresses the problem of identifying which category a new observation belongs to. There are no numerical correlations among classification categories. Fortunately, the categories defined in our study also represent the risk levels of companies. Consequently, the application of Equation (3.10) is considered to be reasonable in our case.. 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 16. i Un. v.

(31) Chapter 4 Experimental Results 政 治 大. Experimental Settings. 立. 學. ‧ 國. 4.1. In our experiments, the target value is the volatility of the stock return of the companies in the next year, and our training data is comprised of soft and hard information presented. ‧. in the financial reports. Specifically, the entire training data is obtained from financial reports published in the past five years, the reports of the following year is defined as the. Nat. sit. y. test data. That is, if the training data consist of reports from 1996 to 2000, the test data is defined as the reports publish in 2001. While previous studie focused on the applica-. io. er. tion of regression [11], our work extends the use of this technique to include a ranking. al. n. iv n C h earen g sensitive approaches. These approaches illustrated in the Figures (4.1, 4.2 and 4.3). chi U approach. In addition, we proposed additional considerations for sentiment and cost-. Comparisons were conducted between the proposed and standard approaches and will be further described in this section.. Figure 4.1: Training instances in Regression. Figure 4.2: Training instances in Ranking. Model. Model 17.

(32) Figure 4.3: Training instances in Cost-sensitive Ranking Model. 4.1.1. Dataset. 治 政 大this information on a regular Local federal securities laws require companies to share 立 basis. In particular, the 10-K form is an annual report required by the Securities and ExThe experiments were performed on a real-world dataset collected from the 10-K form.. ‧ 國. 學. change Commission (SEC). These reports are available to the public and are published on the SEC’s website. The real-world dataset is provided in [11]. It contains 54,379 reports. ‧. that were published between 1996 and 2006 by 10,492 different companies. Each report has a date of publication, which is important for tracking associated hard information.. y. Nat. The statistics on the corpora and financial lexicon are shown in Tables (4.1) and (4.2). In. sit. the sentiment analysis experiments, our dataset is identical to the one used in previous. er. io. study [11]. Owing to the restrictions of the cost-sensitive approach, some companies with. al. n. iv n C the subset included only companies with records of daily stock returns. h ecomplete ngchi U. incomplete information were excluded. To accomplish this, the full dataset was subsetted;. 4.1.2. Extracted Features. Soft Information In our study, we used the bag-of-words model to evaluate the 10-K reports and select to two types of word features. Given a document d, the word features (i.e., TFIDF and LOG1P) were calculated as follows: • TFIDF(t, d) = TF(t, d) × IDF(t, d) = TC(t, d)/|d| × log(|D|/|d ∈ D : t ∈ d|), • LOG1P = log(1 + TC(t, d)). Here T C(t, d) represents the term count of t in d, |d| is the length of document d, and D represents the set of all documents in each year. Note that IDF is computed from the documents in a single year because the document frequency of a specific word may vary 18.

(33) Year. # of Documents. # of Unique Terms. 1996. 1,406. 19,613. 1997. 2,260. 26,039. 1998. 2,461. 29,020. 1999. 2,524. 30,359. 2000. 2,424. 30,312. 2001. 2,596. 32,292. 2002. 2,845. 38,692. 2003. 3,611. 48,513. 2004. 3,558. 50,674. 2005. 3,474. 53,388. 2006. 3,306. 51,147. 治 政 大 Table 4.1: Statistics of the Corpora 立 # of Stemmed Words. Fin-Neg. 2,349. 918. Fin-Pos. 354. 151. Fin-Unc. 291. 127. Fin-Lit. 871. 443. MW-Strong. 19. 10. MW-Weak. 27. 15. er. io. sit. y. ‧. Nat. al. 學. # of Words. ‧ 國. Dictionary. n. iv n C U Lexicon. h the Table 4.2: Statistics of e nFinancial g c h iSentiment Total. 3,911. 1,664. across different years. Following [11], we also calculated the logarithm of the twelve months before the period representing the report volatility data (i.e., log(v)−12 ) as an additional feature. These trained models are identified as TFIDF+ and LOG1P+ hereafter. In the sentiment analysis and cost-sensitive approach, only the financial sentiment lexicon was taken into account. For example, delist, def ault, amend, prof it, etc.. were used in the bag-of-words models.. Hard Information CUSIP is a nine-character alphanumeric code that identifies North American financial security to facilitate the clearance and settlement of trades. CUSIP can be used like an ID to help track companies. Daily closing prices of companies were searched in the 10-K reports from ”Wharton Research Data Services” using CUSIP codes. By applying 19.

(34) Equations (3.8) and (3.9), all of the daily closing prices were used to determine cost weights in the cost-sensitive approach. Financial Sentiment Lexicon For most of the sentiment analysis algorithms, a well-defined sentiment lexicon is the most crucial resource [6]. A general-purpose sentiment lexicon may misclassify common words in financial texts. For this reason, a finance-specific sentiment lexicon was defined to represent the soft information in financial reports. This sentiment lexicon is written by many financial experts to ensure the accuracy of the meanings of the words as they are used in the field of finance.. 4.1.3. Evaluation Metrics. 治 政 大 one of these included the We employed three metrics to evaluate prediction performance; 立 Mean Squared Error (MSE). ‧ 國. 學. n. 1X +(12) 2 +(12) MSE = )) , ) − log(ˆ vi (log(vi n i=1. (4.1). ‧. where n is the number of tested companies. MSE measures the average of the squares. y. Nat. of the errors, i.e., the difference between the true stock return volatility and the estimated. sit. stock volatility in regression task.. al. er. io. For the ranking task, two rank correlation metrics were used to evaluate the perfor-. n. mance of our experiments: Spearman’s Rho and Kendall’s Tau. Given two ranked lists. Ch. X = {x1 , x2 , ..., xn } and Y = {y1 , y2 , ..., yn },. e nP gchi. Rho = 1 −. Tau =. 6. i Un. v. (xi − yi )2 n(n2 − 1). #concordant pairs · #discordant pairs 0.5 · n · (n − 1). (4.2). (4.3). For the measure of Kendall’s Tau, any pair of observations (xi , yi ) and (xj , yj ) is concordant if the ranks for both elements match;; that is, if both xi > xj and yi > yj or if both xj > xi and yj > yi . In contrast, it is discordant. If xi = xj or yi = yj , the pair is neither concordant nor discordant.. 4.1.4. Parameter Settings. In the sentiment analysis model, a linear kernel for the regression task is adopted with  = 0.1 and the trade-off C is set to the default value of SVMlight [9]. These settings are comparable to those applied in [11]. To rank the companies, a linear kernel is adopted 20.

(35) with C = 1 and all the other parameters were set to the default values of SVMRank [10]. In the cost-sensitive model, the ranking task utilize the default values of SVMWeight [2].. 4.2. Finance-Specific Sentiment Lexicon Based Model. The word feature LOGP+ was selected for the regression task and TFIDF+ was selected for the ranking task, following the recommendations given in [11] and [24]. Previous models were trained on original texts and the results were stored in the ORG row. The models trained on the sentiment lexicon were listed in SEN row. The performance of ORG model (original texts) and SEN model (only sentiment texts) are listed in the Table (4.3). The boldface number in the Table (4.3) represents the best results between the. 政 治 大. two approaches. We observed that most results trained using sentiment approach performed better than those trained using the original textual information. 2001. 立 2002. Mirco-avg. 學. 0.14287. 0.15271. 0.18506 0.16367 0.15795. 0.12822. 0.13029. 0.13998. 0.14894. SEN. ‧. Kendall’s Tau. 0.59350 0.59651 0.57641. 0.63349. 0.59017 0.60273 0.58287. io. ORG. 0.65271 0.66692 0.61662. SEN. 0.66397. er. Spearman’s Rho. sit. 0.62280 0.60527. y. 0.62173 0.63626 0.58528. Nat. (TFIDF+). 2006. 0.13038. (LOGP+). SEN. 2005. 0.12879. ORG 0.18082. Ranking. 2004. Mean Squared Error. Regression. ORG. 2003. 0.17175 0.17157. ‧ 國. Task(Features). 0.59965 0.60458. 0.62317 0.62531 0.60371. 0.62939. 0.60999. 0.63403. n. a l 0.65303 0.63646 0.61953 0.63133 iv n Ch engchi U. Table 4.3: Experimental Results of Using Original Texts and Only Sentiment Words The results are summarized below: • Less noise occurred in the training data. In machine learning, the training data often contains significant amounts of noise which, which lowers the precision of the prediction. Most data scientists seek solutions to minimize or eliminate data noise; the method used to select training data with less noise is an important issue. Sentiment lexicon based models follow standard prescribed methods, but use different textual information as features. To take advantage of expert knowledge to select appropriate features, only finance-specific sentiment words found in financial reports were used to represent soft information. A document is represented by words occurring in both financial reports and the finance-specific sentiment lexicon. Consequently, the amount of textual features was reduced from 107,153 to 1,664. 21.

(36) • Finance-specific sentiment lexicon are representative features. Sentiment analysis is extremely useful in big data as it allows us to gain an overview of a wider range of public opinions about certain topics. Since the sentiment lexicon is written by many financial experts, it is possible to reflect the meaning of the words as they are used in the field of finance. Additionally, labeled data is beneficial in the training process and the labeling of data based on expert knowledge is still more reliable than data that is labeled automatically (i.e., by machines). In our case, the effects of the words were classified into 6 classes including positive, negative, etc. Sentiment lexicon extracts insights from financial experts and these insights can then be correlated to shifts in the stock market. Term Consistency of Regression and Ranking. 政 治 大 (LOGP+) models trained on sentiment 立 words are listed. The figure also lists the number In Figure (4.4), the top 10 learned words from both ranking (TFIDF+) and regression. of times these words occurred in the six corresponding regression or ranking models. The. ‧ 國. 學. notation * denotes that with the exception of the term “concern” there are other terms that occur only once among the 6 ranking models. These terms include: breach, profit, violat,. ‧. regain, uncomplet, accid, abl, integr, doubt, and grantor. Similarly, for the notation ∧, the terms include: incorrectli, fault, nondisclosur, misus, breakag, defalc, excit, unclear,. y. Nat. sit. sentenc, overdu, omit, inforc, irrevoc, unencumb, further, variant, precipit, libel, and loss.. er. io. From Figure (4.4), we can observe that amend, def icit, and f orbear appear in all the 6 models. Moreover, 7 words from the ranking models occur in the top 10 weighted. n. al. Ch. i Un. v. terms more than 4 times, while only 3 words from regression models occur in the same. engchi. list. The results shown in Figure (4.4) provide a good interpretation of the results and also validate the effectiveness of the adopted ranking models to capture the relationships between financial risk and textual information; these results suggest that there may be a more optimal way of capturing the information of finance than the regression models [24]. Financial Sentiment Terms Analysis In Figure (4.5), the top 5 results are marked by numbers from 1 to 5 according to their average weights. The top 5 average weighted terms based on data trained only with sentiment words (SEN) include: amend, def icit, f orbear, delist, and def ault, whereas those trained with the original text include: ceg, nasdaq, gnb, coven, and f orbear. An interesting finding is that when the models are trained with original texts, some less informative terms like ceg (a company name, Co-Energy Group), nasdaq (an American stock exchange), and gnb (a company name, GNB Technologies), were highly ranked; however, the relationship between these terms and financial risk is considered to be weak. In 22.

(37) Ranking Regression. ^incorrectli fide noncompli incompat inasmuch encumbr unreimburs brilliant disappoint cutback malfunct indefeas *concern sever disput benefici unabl wherebi discontinu sureti delist default forbear deficit amend. #Occurrence of the top 10 weigited terms among 6 models. 7 6 5 4 3 2 1 0. 政 治 大. Figure 4.4: Number of Occurrences of the Top 10 Weighted Terms Learned Task(Features). 2001. 2003. 2004. 2006. Mirco-avg. 0.60436. 0.58887. 0.60226. 0.58605. 0.60584. COST 0.62557. 0.63328. 0.604504. 0.58995. 0.60950. 0.58768. 0.60841. 0.62815. 0.60437. 0.58887. 0.60227. 0.58605. 0.60585. COST 0.65601. 0.66352. 0.63537. 0.61922. 0.63745. 0.61369. 0.63754. sit. Nat. 0.62538. y. SEN. Spearman’s Rho. ‧. ‧ 國. 0.62815. 學. Kendall’s Tau. 2005. 0.62537. SEN Ranking. 立 2002. er. io. Table 4.4: Experimental Results of Sentiment Based and Cost-Sensitive Based Methods. al. n. iv n C contrast, since only sentiment words h were to train e nused h i Uthe model, it is more reasonable c g that these terms are more highly related to financial risk.. 4.3. Cost-Sensitive Based Model. The models that were trained on the sentiment lexicon are listed in SEN. COST is the cost-sensitive approach based on SEN. Table (4.4) lists the results obtained with SEN and COST. Because of the restrictions of the collected data, there are some observable differences between the SEN and COST training data. As a result, the SEN baseline data presented in Table (4.3) differs from that in Table (4.4). However, all baselines were produced in the same way. The best results are presented in boldface in Table (4.4). All of the results in the COST row performed better than those in the SEN row. This result shows a statistically significant improvement when using the proposed model. 23.

(38) default defaulted defaulting defaults. Fin-Neg delist delisted deslisting delists. Fin-Pos Fin-Lit Fin-Unc Non. sureti 4. forbear forbearance forbearances forbearing forbears. default default. forbear. discontinu. 3. regain. 5. accid breach. deficit. benefici. awg. 1. sever disput. deficit deficits. smallcap. driver. stage. rais. excelsior. same. amend. gnb. waiver. 3. 1. coven 4. 學. ‧ 國. seri. special. 政 治 大. amend amendable amendatory amended amending amendment amendments amends. 立. 2. amend. concern doubt. pfc. ebix. hearth. syndic. nasdaq. 2. grantor. integr. ceg. libert profit. abl. shelbour n. placement. forbear 5. violat. uncom -plet. unabl. ORG. sureti. delist. wherebi. SEN. ‧. Figure 4.5: Highly-Weighted Terms Learned from the 6 Ranking Models of Using Original Texts (ORG) and Only Sentiment Words (SEN). y. Nat. er. io. sit. Influence of Weight on Ranking Pairs. We used Equation (3.10) to calculate expected values and to rank the pairs based on these. n. al. i Un. v. values. Thus, the impact of misclassifying high-risk level companies is greater than that. Ch. engchi. of misclassifying low-risk level companies. For example, the chance of errors occurring at high-risk levels is greater than those occurring at low-risk levels. We assume that the differences are 0.5 and 0.1 and that the number of pairs in the range 0.5 would be more than those in the the range of 0.1. The errors that occur at high-risk levels produce more discordant pairs. In cost-sensitive learning techniques, the weight of an instance represents its importance in the training data. As described in Equation (3.6), companies are assigned different risk levels according to the volatilities of their stock returns. Risky companies are assigned a higher weight based on the relationships described in Equation (3.8). Analysis of Instance Weight Based on Equations (3.8) and (3.9), the influence of positive and negative returns was evaluated. To determine the significance of the result, permutation tests were conducted with different parameter settings. Negative and positive returns were assigned different multipliers to verify whether there were differences in their respective strengths. In these 24.

(39) WͲǀĂůƵĞ Ϭ͘ϭϰ Ϭ͘ϭϮ Ϭ͘ϭ Ϭ͘Ϭϴ Ϭ͘Ϭϲ WͲǀĂůƵĞ. Ϭ͘Ϭϰ Ϭ͘ϬϮ Ϭ. Figure 4.6: Comparison of the P-values of the Cost-Sensitive model in different weight and Sentiment model experiments, the summation of α and β is 200; there are a total of 201 tests (α = 0,. 政 治 大 201 tests (from (β = 102,α 立 = 98) to (β = 152, α = 48)). By setting Equations (3.8). β = 200), (α = 1, β = 199), ... ,(α = 200, β = 0). Figure (4.6) illustrates a part of the and (3.9) as the weight of instances, consistently good results were obtained, which out-. ‧ 國. 學. perform those trained with SEN. In this section, all of the COST results are better than the SEN results and 60% of them passed the permutation tests with a p-value of 0.05.. ‧. In general, these results demonstrated that using daily returns as cost weights may be an effective way to combine soft and hard information instead of placing both of them into. y. Nat. sit. a document vector. We also verified that ”The literature documents that low stock returns. er. io. are associated with increased volatility” [1]. Furthermore, we observed that the number of results where the Cost-Sensitive model outperformed Sentiment model was higher in. n. al. Ch. i Un. v. the section when β > α than when β < α. Our results were identical to those from the financial analysis.. engchi. 25.

(40) 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 26. i Un. v.

(41) Chapter 5 Conclusions In conclusion, the first part of this study demonstrates the importance of the sentiment. 政 治 大 much few words to train the SVM models (i.e., regression and ranking) and is capable of 立 producing comparable results. That is, finance-specific lexicon is able to better represent lexicon in financial reports associated with financial risks. The sentiment method uses. ‧ 國. 學. the occurrence of the terminology that is commonly found in financial reports. It also provides more meaningful information to investors. Based on the results that were trained. ‧. with ORG and SEN, ORG identified representative words like ”ceg”, ”nasdaq”. ”gnb”, etc., SEN provided representative words like ”amend”, ”deficit”, ”profit”, and so forth.. Nat. sit. y. The terms identified with SEN were considered to be more intuitive than the words in. io. er. ORG to explain why these words are associated to the stock risk.. The second part proposes a new approach to combine the soft and hard information.. al. n. iv n C h e by tion problem, and then ranks the problem h i U values. Besides, we examined n gthecexpected The cost-sensitive ranking task is transformed to a cost-sensitive multi-class classifica-. the influence of the learned weights, and we also found that ”low stock returns are associated with increased volatility” which is consistent with the descriptions in [1]. The cost-sensitive ranking method can result in a significant improvement. After all, theses findings provide us more ideas and understanding about the soft and hard information in financial reports. In addition, the results suggest that the hard information can be uti-. lized as the cost weights of learning techniques. In future work, we plan to extend those sentiment words associated with the stock risks. In addition, we will also explore how to conduct deep-learning techniques on this problem, in order to learn more fine-grained financial sentiment lexicons.. 27.

(42) 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 28. i Un. v.

(43) Bibliography [1] J. Bae, C.-J. Kim, and C. R. Nelson. Why are stock returns and volatility negatively correlated? Journal of Empirical Finance, 14(1):41–58, 2007. [2] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM. 政 治 大. Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.. 立. [3] K.-T. Chen, T.-J. Chen, and J.-C. Yen. Predicting future earnings change using. ‧ 國. 學. numeric and textual information in financial reports. In Intelligence and Security Informatics, pages 54–63. Springer, 2009.. ‧. [4] N. Chen, A. S. Vieira, J. Duarte, B. Ribeiro, and J. C. Neves. Cost-sensitive learn-. y. Nat. ing vector quantization for financial distress prediction. In Progress in Artificial. io. sit. Intelligence, pages 374–385. Springer, 2009.. n. al. er. [5] R. Engle. Risk and volatility: Econometric models and financial practice. American Economic Review, pages 405–420, 2004.. Ch. engchi. i Un. v. [6] R. Feldman. Techniques and applications for sentiment analysis. Communications of the ACM, 56(4):82–89, 2013. [7] G. P. C. Fung, J. X. Yu, and W. Lam. Stock prediction: Integrating text mining approach using real-time news. In Computational Intelligence for Financial Engineering, pages 395–402. IEEE, 2003. [8] D. Garcia. Sentiment during recessions. The Journal of Finance, 68(3):1267–1300, 2013. [9] T. Joachims. Making large scale svm learning practical. Technical report, Universit¨at Dortmund, 1999. [10] T. Joachims. Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 217–226. ACM, 2006. 29.

(44) [11] S. Kogan, D. Levin, B. R. Routledge, J. S. Sagi, and N. A. Smith. Predicting risk from financial reports with regression. In The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 272–280. ACL, 2009. [12] A. J. Lee, M.-C. Lin, R.-T. Kao, and K.-T. Chen. An effective clustering approach to stock market prediction. In Pacific Asia Conference on Information Systems, pages 345–354, 2010. [13] H.-T. Lin. A simple cost-sensitive multiclass classification algorithm using oneversus-one comparisons. National Taiwan University, Tech. Rep, 2010. [14] T. Loughran and B. McDonald. When is a liability not a liability? textual analysis,. 政 治 大. dictionaries, and 10-ks. The Journal of Finance, 66(1):35–65, 2011.. 立. [15] S. M. Mohammad and P. D. Turney. Emotions evoked by common words and. ‧ 國. 學. phrases: Using mechanical turk to create an emotion lexicon. In Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, pages 26–34. ACL, 2010.. ‧. [16] R. Narayanan, B. Liu, and A. Choudhary. Sentiment analysis of conditional sen-. y. Nat. sit. tences. In Conference on Empirical Methods in Natural Language Processing, vol-. er. io. ume 1, pages 180–189. ACL, 2009.. al. n. iv n C market prediction. In International Computer and Automation Engih eConference n g c h ionU. [17] A. Nikfarjam, E. Emadzadeh, and S. Muthaiyah. Text mining approaches for stock neering, volume 4, pages 256–260. IEEE, 2010.. [18] A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. In Language Resources and Evaluation Conference, volume 10, pages 1320– 1326, 2010. [19] B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1-2):1–135, 2008. [20] M. A. Petersen. Information: Hard and soft. Technical report, working paper, Northwestern University, 2004. [21] R. P. Schumaker and H. Chen. Textual analysis of stock market prediction using breaking financial news: The azfin text system. ACM Transactions on Information Systems, 27(2):1–29, 2009. 30.

(45) [22] A. Smola and V. Vapnik. Support vector regression machines. Advances in neural information processing systems, 9:155–161, 1997. [23] S. Takahashi, M. Takahashi, H. Takahashi, and K. Tsuda. Analysis of stock price return using textual data and numerical data through text mining. In KnowledgeBased Intelligent Information and Engineering Systems, pages 310–316. Springer, 2006. [24] M.-F. Tsai and C.-J. Wang. Risk ranking from financial reports. In Advances in Information Retrieval, pages 804–807. Springer, 2013. [25] R. S. Tsay. Analysis of financial time series, volume 543. John Wiley & Sons, 2005. [26] B. Wuthrich, V. Cho, S. Leung, D. Permunetilleke, K. Sankaran, and J. Zhang. Daily. 政 治 大 tems, Man, and Cybernetics, volume 3, pages 2720–2725. IEEE, 1998. 立. stock market forecast from textual web data. In International Conference on Sys-. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 31. i Un. v.

(46)

參考文獻

相關文件

Financial Reporting),及英國研究企業管治財務範 疇的委員會(Committee on the Financial Aspects of Corporate Governance),又稱「坎特伯里委員

Once we introduce time dummy into our models, all approaches show that the common theft and murder rate are higher with greater income inequality, which is also consistent with

Empirical analysis results show that:in term of the willingness-to-pay, the consumers who are using the IMVS wish to reduce their monthly expenditure, in which those who

The final results of experiment show that the performance of DBR system declines when labor utilization increases and the CCR-WIP dispatching rule facilitate to

The experimental results show that the developed light-on test methodology can effectively detect point defects (bright point, dark point, weak point), line defects (bright line,

本研究主要以 But-for 崩塌竣工時程分析技術為基礎進行理論推導,確認此延遲分析技術 計算邏輯之問題與完整性,之後提出修正之計算邏輯,使

本研究以 2.4 小節中之時程延遲分析技術相關研究成果為基礎,針對 Global Impact Technique、Net Impact Technique、As-Planned Expanded Technique、Collapsed

Our preliminary analysis and experimental results of the proposed method on mapping data to logical grid nodes show improvement of communication costs and conduce to better