Data Mining - 投資人之交易偏好分析

(2) Data Mining: A brief explanation

There are many complicated systems bearing huge amount of data and parameters. Besides, they usually have non-predictable and chaotic behavior. To reveal interrelations among the data and parameters, a number of new methods of computer analysis have been developed. One useful tool for discovering the relationships among parameters is data mining.

Data mining has been defined in various ways. In summary, data mining can be explained as follows (Jones, 2001, pp.665):

“The search for relationships and global patterns that exist in large databases but are ‘hidden’ among the vast amounts of data.”

“The semi-automated process of extracting previously unknown, potentially useful and interesting knowledge from large and often disparate, historical data sources.”

The objective of data mining is to find valuable information from the vast amounts of data. Thus, it can help the managers to make more useful and precise decisions. The main difference between statistical analysis and data mining is that the former is based on hypothesis and verification.

Statistical analysis tries to find the relationships on comparatively less data or to analyze different targets on statistics. On the other hand, the basis of data mining is discovery, and it focuses on pattern identification.

(2) Models of Data Mining

The models of data mining are mainly composed of four types: data classification, data association, data clustering, and sequential pattern mining. Data classification is supervised learning, and its function is to analyze the attributes of data, and then to classify and define the data to establish classes. One example of data classification is the analysis of disease causes. The objective of data association is investigating the relationships between data items, and finding the items which have high correlation. Data association is useful on market basket analyses. It aids in understanding the consumption behavior of consumers. As for data clustering, it is unsupervised learning. Data clustering automatically divides the database into several clusters bearing similar features. The objective of it is to figure out the differences between clusters, and it also can find the similarities of the members in one cluster. The differences between clustering and classification is that in clustering, we do not know in advance how many clusters the database will be divided or what features it is based on to partition the data. Therefore, it is essential that we need to interpret the meanings of the divided clusters as the clustering result.

However, data classification maps data into predefined groups or classes.

Hence, it is the prediction of the data pattern. Concerning sequential pattern

mining, it is to analyze changing of the sequence, and try to predict the future condition from the related sequence. For example, from the present quantities of notebooks customers buy, we can estimate the sales of flash drives in order to prepare enough inventories. In our thesis, we use data clustering as the tool to analyze the potential preferences of various investors.

(3) Methods of Clustering

1. The Hierarchical Clustering

The agglomerative hierarchical algorithms are largely used as an explanatory statistical technique to determine the number of clusters of data sets (Anderberg, 1972). Generally speaking, they work in the following way. First, each of the n objects to be clustered is regarded as a unique cluster. Next, the objects are compared among themselves by using a measure of distance, and then the two clusters with the smallest distance are combined. The same process is repeated over and over again until the desirable number of clusters is obtained. In each stage, only two clusters can be combined, and they can not be separated after they are joined. Some popular hierarchical clustering algorithms include single linkage, complete linage, average linkage, Ward’s minimum variance method and so on. The main advantage of hierarchical clustering is that the decision maker can choose the appropriate number of clusters by himself based on his need.

However, although the hierarchical methods work well for data sets with compact, isolated clusters, they fail to accurately define clusters with messy data which is depart from the ideal condition (Mangiameli et al., 1995). In addition, it also has been shown that the performance of any single hierarchical clustering is dependent on the data conditions, and that no single method is robust over a wide range of data conditions (Mangiameli et al., 1995). Furthermore, the objective of cluster analysis is to define districts in the multidimensional variables rather than compact and isolated clusters (Helsen and Green., 1991). Because of these problems, we do not consider hierarchical clustering as the method of data mining in our thesis.

Figure 1: Illustration of a Hierarchical Clustering Algorithm (Vijaya et al., 2006)

2. The Nonhierarchical Clustering

Contrary to hierarchical clustering, the desired number of clusters k has to be defined in advance when using the nonhierarchical clustering.

Then, the nonhierarchical clustering algorithms cluster the n objects into k clusters. In such way, the data in the same cluster own similar characteristics, and the members of different clusters are heterogeneous.

Because the data set in this thesis is messy and large, and some existing studies has shown that the performance of some nonhierarchical clustering methods are better than hierarchical clustering methods. For example, Mangiameli et al. (1995) compare the cluster membership accuracy of self organizing map (SOM) network with the accuracy of seven popular hierarchical clustering algorithms. The result shows that self organizing map neural network clustering methodology is superior to the hierarchical clustering methods. Therefore, we use nonhierarchical clustering to analyze the trading preferences of various investors. Among the methods of nonhierarchical clustering algorithms, we choose self organizing map which is the most widely applied unsupervised learning algorithm of artificial neural network in this thesis.

(4) Self Organizing Map

Self organizing map (SOM) proposed by Kohonen (1987) is one

important artificial neural network. The SOM network is trained by an unsupervised competitive learning algorithm, a process of self organization (Mangiameli et al., 1996). It usually consists of an input layer and the Kohonen layer which is designed as two-dimensional arrangement of neurons that maps n dimensional input to two dimensions. Kohonen’s SOM associates each of the input vectors to a representative output. The network finds the node closest to each training case and moves the winning node, which is the closest neuron (i.e., the neuron with minimum distance) to the training case.

Figure 2: Illustration of the Network of 4×4 Kohonen’s Self Organizing Map (Kuo et al., 2002).

Chen et al. (1995) investigates the ability of specific neural network architectures utilizing unsupervised learning to recover cluster structure from multivariate data sets with various levels of relative cluster dispersion.

Their result indicates that the SOM network is the superior cluster technique and that the relative advantage of SOM over conventional techniques such as single linkage, Ward’s minimum variance method and so on increases with higher levels of relative cluster dispersion in the data.

Eklund et al. (2002) evaluate the performance of SOM for analyzing financial performance of pulp and paper companies. They collect the financial information and choose several financial ratios of the 77 companies in the pulp and paper industry. Then, they use Kohonen’s SOM to perform a financial competitor benchmarking of these companies. In order to verify whether the companies are properly clustered, they confirm the discovered patterns by reading the annual reports. The result of their study suggests that SOM is a feasible and effective tool for analyzing huge

amount of financial data.

(5) The Decision Tree

The decision tree is a common data mining technique for knowledge discovery. It organizes and analyzes the data to extract valuable rules and make predictions to provide a basis for model constructions (Dale et al., 2007). The decision tree takes the form of top-down tree structure, which splits the data to create “leaves.” Under the structure, one target class is dominant, and each record flows through the tree along a path determined by a series of tests until it reaches a terminal node (Lee et al., 2007).

Because the decision tree can extract rules of the variables, we use it to further analyze the result of SOM in order to obtain more reliable conclusions. There are various decision tree algorithms, such as CART, C5.0, and CHAID. The differences between these algorithms are as follows: how many splits are allowed at each level, how the splits are chosen when the tree is built, and how the tree growth is limited to prevent over-fitting (Berry and Linoff, 2000). In our thesis, we apply both CART and C5.0 to find the rules.

4. Related Work

Data mining techniques have been applied in many business areas.

Wong and Selvi (1998) review the literature from 1990 to 1996, which applies neural networks in finance, they find that stock performance and selection prediction is one of the most commonly applied areas. Deboect (1998) documents a list of financial, economic, and marketing applications of SOM (see Table III). The following are some existing studies which apply data mining methods in the stock market research.

Table III

Financial, Economic, and Marketing Applications of SOM.

Financial Initial financial analysis

Analysis of financial statements Financial forecasting

Prediction of failures

Rating of financial instruments

Analysis of investment opportunities Selection of investment managers Commercial credit risk analysis Country credit risk analysis Construction of benchmarks Strategic portfolio allocations

Monitoring of financial performance Economic Analysis of economic trends

Forecasting of macro-economic indicators Mapping socio-economic development Knowledge management in economics Marketing Customer profiling and scoring

Analysis of customer preferences Prediction of customer behavior Segmentations of markets

Direct support to strategic marketing (2) Application in Forecasting Stock Market Returns

Enke and Thawornwong (2005) apply data mining to forecast the stock market returns. They propose an information gain technique in machine learning to evaluate the predictive relationships of financial and economic variables. The result indicates that the trading strategies guided by the neural network models produce higher profits under the same risk exposure than those generated by the buy-and-hold strategy. Chen et al.

(2003) use neural networks to model and predict the direction of return in Taiwan stock market. The performance of the forecasts by using neural networks is compared with that guided by the buy-and-hold strategy as well as the investment strategy forecasted by the random walk model and the parametric generalized methods of moments. The results show that the investment strategies based on neural networks obtain better returns than other investment strategies examined.

(2) Application in Knowledge Discovery in Financial Investments Kuo et al. (2004) apply SOM networks to identify the movement trends of stocks in Taiwan stock market in order to improve the accuracy of

uncovering trading signals and maximize the trading profits. They provide the decision model to help investors make profitable decisions. Li and Kuo (2006) propose a decision support model to provide the stock investors with systematic trading strategies. The knowledge discovery process they use is composed of feature representation by K-chart technical analysis, feature extraction by discrete wavelet transform, and modeling by SOM networks. Then, they verify the forecasting accuracy by performing experiments using Taiwan Weighted Stock Index from 1991 to 2002. They find that SOM is capable of producing good forecasting accuracies and gain rates.

5. Summary

There are a large number of studies that investigate the trading preferences of stock investors. Most of the existing studies focus on the stock preferences of institutional investors as revealed by the holdings of their security portfolio. However, research on the preferences for stock characteristics of individual investors is relatively lacking. One possible reason is the difficulty of obtaining related data on individual investors (Ng and Wu, 2006). Besides, a lot of studies have examined the stock characteristics that may have significant effects on investors. Nevertheless, only a few of existing studies investigate the features of investors that differentiate the choices of their securities portfolio. More importantly, in Taiwan, to the best of our knowledge, no study has explored the trading preferences of the investors in the stock market and the stock characteristics and investors’ features that impact their choices. Therefore, this thesis aims to figure out the relationships between stock characteristics and stock investors in different dimensions in Taiwan.

Almost every existing study that examines the stock preferences of investors applies the statistical regression analysis to do the research.

Nevertheless, with the progress of computer techniques, a number of new methods of computer analysis have been developed, and data mining is one example. Clustering is a useful tool of data mining to partition the data and assign the members to the groups with closest patterns. Among the methods of clustering, Kohonen’s self organizing map is widely applied and recommended by existing studies. Therefore, we apply SOM to do the research, and compare it with the result of regression analysis to find

whether there are any similarities or differences.

Research Methodology

1. Data

The data set used in this thesis comes from two sources. One is the data of investors’ trading records which are provided by a security firm in Taiwan. This data set contains the basic background information about each investor, such as gender, region, and wealth. The data set also contains the trade records of the investors including trade shares, prices, amount, and dates, time, etc. With the trade records, we can examine the trading behavior of the investors. The other source is the financial dataset provided by Taiwan Economics Journal (TEJ). The main data we use in the TEJ data base are the TEJ Company DB, which includes all financial information contained in the firm’s annual financial statement, and TEJ Equity DB, which records the market-related summary figures of each listed firms, such as P/E ratio, market-to-book, etc.

In our trading dataset, there are 203,092 trading accounts with 18,481,001 trading records between 2004/1/1 and 2004/12/31. We apply the following criteria for data screening:

For investor accounts that exhibit the following features are deleted (93,377 accounts are left for this study):

1. Accounts with null value or errors.

2. Accounts with less than five trading records during our sample period (2004/1/1 ~ 2004/12/31).

Because this thesis aims to investigate the trading preferences of investors, we can not infer much about an investor’s trading preference with only few number of trading records. Hence, in order to make the results more reliable, we only consider the investor accounts with at least five trading records during our sample period.

For stocks with the following features are deleted (after screening, there are 501 stocks left in our sample):

1. The data with null value or errors.

2. The price of the stock has been below $10 during the sample period.

3. The stocks that are not listed or delisted in the Taiwan Stock Exchange during our sample period.

After filtering the data, 93,377 accounts with 6,993,052 trading records for 501 firms are left in our sample. Table IV summarizes the trade records of investors in the data set. The records indicate that the number of individual investors and their trade value are both much larger than those of institutional investors. Thus, individual investors play an important role in stock market, and it again shows that the analysis of their trading preferences is essential for Taiwan Stock Exchange.

< Table IV Inserted Here>

2. The Measure of Trading Preferences

The measure of “trading” preference used in this study is modified from the measure of stock preferences developed by Ng and Wu (2006), which only considers the buy trade of investors because their study focuses on measuring investors’ preference of “holding.” On the other hand, this study analyzes investors’ preference of “trading” and therefore both buy and sell trades are considered for the measurement of trading preference.

The measurement of the trading preferences (TPi^j) applied by Ng and Wu (2006) is based on the market value of the stocks. Their model is as follows: tradable shares of stock i multiplied by its closing price on day t. Ng and Wu’s measure of preference of holding is obtained by dividing the capital inflow by the market capitalization of a firm. However, such measure of preference will under-estimate less-wealthy investors’ preference over stocks with high capitalization. When less-wealthy individual investors pour all his wealth into a high-capitalization stock, it is very obvious that they have great preference over this stock. However, it will be difficult for these less-wealthy investors to obtain high preference measure according to Ng and Wu’s method.

Based on the above discussion, we develop another measurement method for preference of “trading” which is measured by the number of trading towards one specific stock by the investor. The larger the number of trading, which indicates that the investor trade more frequently for the

stock, the greater the investor prefers trading that stock. The measure for trading preference of jth individual over ith stock (TTPi^j) is defined as follows: investor j on day t, and ^SUM^j is the total number of trading by investor j during our sample period.

The above trading preference measure is designed to measure the relative tendency for an investor to trade a specific stock. A large value of

TTPi will indicate that the jth investor is relatively more interested in the trading of the ith stock. For each account, there is one trading preference measure for every stock the investor make trading. If the investor invests in a stock more than once, the value of preference calculated from each trading record will be summed to calculate the preference for that stock by the investor. Finally, there are 1,125,383 records left in our sample.

3. Stock Characteristics

Several characteristics of stocks which may affect investors’ trading preference are investigated. The definitions of these stock characteristics are discussed in the following:

Age: The number of years since the stock has been listed. We can examine whether the investors prefer “seasoned” firms or “young”

firms through this variable.

Beta¹: Beta is calculated from the market model which can be obtained directly from the TEJ equity module. Beta represents the market risk of the stock, and it can be used to investigate whether the investors tend to incur more trading for risky stocks.

Dividend yield: Annual dividends per share divided by stock price per

1 The market model used to calculate the Beta is as follows: Rit i iRmt i.

where R is the rate of abnormal return of stock i, it R_mt is the rate of return of the market of the Taiwan Stock Exchange Corporation, i denotes the intercept, i is the systematic risk, which is the common factors that will affect the stock prices of most firms in the market, and i denotes the unsystematic risk, which is the risk specific to each firm. Beta is used to measure the value that stock i is affected by unsystematic risk; namely, the value of  .

share at the end of 2003. Investors with different wealth (taxes bracket) or different backgrounds may incur more trading for stock with higher dividend yields.

EPS: The earnings per share of the firm at the end of 2003. EPS is the

在文檔中投資人之交易偏好分析 (頁 16-0)