Introduction

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

1 CHAPTER 1 Introduction

1.1 Background

As the data mining concept successfully applied in mining the relationships between diapers and beer, many useful patterns that had never been imaged in financial applications have been discovered by various kinds of algorithms. Rather than only experimenting with financial quantitative data that traditional empirical Finance and Accounting research methods merely cope with numeric data or nominal data, our work propose methods that combined financial quantitative data and textual contents for examining the relationships between the annual earnings and subjective expressions in U.S. financial statements. In today’s world, there are huge of public available textual information rather than numeric data. Li concluded that textual information is much more costly to process which is not trivial amount of effort comparing with simple numeric data, and found that U.S. public companies annual financial statements contain thousands of words [14]. For retrieving relevant and important knowledge from the huge textual data, we use the methodology of information retrieval and knowledge discovery for extracting useful textual patterns. Furthermore, our work, which includes domain knowledge in Opinion Mining, Sentiment Analysis, Natural Language Processing, Machine Learning, Finance and Accounting, has been integrated comprehensively.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

2

Opinion mining and sentiment analysis have been widely discussed in not only Computer Science but also Finance. We propose a way to automatically expand Loughran and McDonald’s single-word lists [18]that are linked to positive and negative financial statuses into subjective multi-word expressions (MWEs).

Loughran and McDonald developed positive and negative single-word lists that better reflect the tones of the U.S. financial statements (10-K filings) than the words in traditional psychology dictionaries (e.g., Tetlock (2007) [34] and Tetlock et al. (2008) [35] used the weak word and the negative word categories in the Harvard-IV-4 psychological dictionary on General Inquirer’s web site for their researches), and they examined the linkage between the textual statements and the financial figures which included 10-K filing returns, trading, and unexpected earnings etc.

Subjective assertions in financial statements influence the judgments of market participants when they evaluate the value and profitability of the reporting corporations.

Hence, the managements of the corporations may attempt to conceal the negative and to raise the positive statuses with “prudent” wordings. To excavate the truth from the long reports and the overwhelming amount of numbers, we integrated artificial intelligence techniques to investigate the linkage between financial statuses and subjective multi-word expressions (MWEs).

1.2 Methodology overview

We defined opinion patterns to include opinion holders and subjective multi-word expressions (MWEs). For instance, the opinion patterns in the sentence “The Company believes the profits

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

3 could be adversely affected” include opinion holder “The Company” and two subjective

expressions: “believe” and “could be adversely affected”.

Our work can be divided into two parts which include experiments of opinion patterns identification and empirical studies of investigating the relationships between annual earnings and U.S. textual contents.

First, we propose a computational procedure to model the text in financial statements, and employ conditional random field (CRF) models for opinion patterns identification and subjective opinion patterns extraction. We used a variety of linguistic features including morphology, orthography, predicate-argument structure, syntax and simple semantics. For effectively tuning and evaluating the CRF models, we trained and tested the models with the annotated MPQA corpus [21]. The goals of first part include identifying opinion holders and extracting subjective opinion patterns which are in the form of multi-word expressions (MWEs). Besides, we employed the best performing CRF model that we found in a sequence of experiments to identify opinion holders and extract subjective opinion patterns in U.S.

financial statements.

Second, we examine whether the subjective MWEs in the U.S. financial statements reflect the firm’s annual unexpected earnings. We used multinomial logistic regression, which is one of the discriminative models, for explaining the relationships between annual earnings and subjective MWEs and identifying the discriminative subjective MWEs that have special economic meanings.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

4 1.3 Contributions

The major contribution of our work is to find a way to automatically expand the lists of words that are linked to positive and negative financial statuses into subjective multi-word expressions (MWEs). Unlike traditional models that considered individual words (unigrams) and “bag of words” [20], our methods attempt to automatically extract MWEs from textual statements. MWEs could capture the subjective evaluations of the financial statuses of the reporting corporations more precisely, so are potentially more informative than individual words to facilitate better decision makings of creditors and investors.

The other contribution of our work is to provide the economic evidences that could interpret positive and negative financial statuses with subjective multi-word expressions. We found that the companies inclined to use weaker expressions when mentioning negative results. For example, “seriously harmed” and “might be impaired” are negative words expressing damages of something, but the word “seriously harmed” conveys stronger destructivity than “might be impaired” pragmatically. In reality, the managements of the companies have incentives to hide some negative information but to promote positive information. Hence, the analysis of the relationshipss between annual earnings and subjective expressions in financial statements is a tough problem while there is an information asymmetry between readers and reporters of financial statements.

Last but the most important, we propose the comprehensive system that might shed light on discovering unseen knowledge and patterns and also facilitating the decision makers in evaluating the financial status of the company.

‧

The paper is organized as following. First, we propose a computational procedure to model the text in financial statements, which used conditional random field models for opinion patterns identification. Second, we trained and tested the models with the annotated MPQA corpus [21]

for effectively tuning and evaluating the CRF models. Third, we employed the best performing CRF model that we found in a sequence of experiments to extract subjective opinion patterns

MPQA corpus Train and test

opinion pattern

Verify the linkages between earnings and subjective MWEs

Discriminative subjective

MWEs

Interpret the economic meanings of subjective MWEs in different earning

outcomes

Figure 1.1 The system diagram

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

6

in U.S. financial statements. Fourth, we employed multinomial logistic regression to verify whether the opinion patterns are indicative of the earnings of the corporations, and also designed discriminative strategies to quantify the linkages between annual earnings and subjective MWEs. Finally, using the algorithmically identified MWEs, we examined whether companies indeed expressed in different strengths of positivity and negativity for different earning outcomes. The system diagram presents as Figure 1.1.

Chapter 2 reviews related literatures in Finance and Computer Science; Chapter 3 briefly introduces the U.S. financial data and the corpora; Chapter 4 elaborates linguistic features and the CRF models that we used to mine the opinion patterns and subjective MWEs; Chapter 5 presents the method of the linkages between earnings and subjective MWEs; Chapter 6 discusses experimental results of evaluating the opinion patterns identification models by using the MPQA corpus; Chapter 7 interprets the empirical study result of the linkages between earnings and subjective MWEs; and Chapter 8 concludes and reviews the paper.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

7

在文檔中探索美國財務報表的主觀性詞彙與盈餘的關聯性:意見分析之應用 - 政大學術集成 (頁 10-16)

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

1

CHAPTER 1 Introduction

1.1 Background

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

2

1.2 Methodology overview

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

3

could be adversely affected” include opinion holder “The Company” and two subjective

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

4

1.3 Contributions

‧

MPQA corpus Train and test

opinion pattern

Verify the linkages between earnings and subjective MWEs

Discriminative subjective

MWEs

Interpret the economic meanings of subjective MWEs in different earning

outcomes

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

6

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

7

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學