Data Access Tier (DAT) - 分散式計算系統及巨量資料處理架構設計-基於YARN, Storm及Spark

1. Plan Center: The Plan Center is responsible for building trading strategy models.

In the Plan Center, the users can use SVM or LR algorithm to build trading strategy models.

2. R Academy: This service offers an R program development platform for the users. After the creation of the R models, the users can put the strategies on the Trade Center to test them in real time.

VI. Data Access Tier (DAT)

The DAT is used to access all data from the database or outer data sources. All processed data are sent to the DAT that stores the big data persistently with low latency.

Also, the DAT offers an order interface to connect with the external securities dealers for transmission the user orders. On the other hand, the DAT combines several market quotes every ticks into one K-Bar every seconds. The DAT also sends several type of K-Bar from different time granularity.

The DAT has three databases to store different type of data:

1. MongoDB: The MongoDB stores some user relative data, such as the user account data, the trading strategies that created by the users and website dynamic content. Most actions of the online users will be responded quickly because the access time of MongoDB is very low.

2. HBase: The State Center saves the market states data in HBase. By HBase, we can write large data in a very short period of time. If the HFT system support 1500 futures calculation, the DAT needs to store 1,977,000 market states per second (1500 * 1318 market states for every second K-Bar) and 35,586,000,000 market states per day (1500 * 1318 market states for every second K-Bar * 18,000 seconds for every day). That’s big data.

‧

3. HDFS: Before the Plan Center builds model, it reads large market states data from HBase and writes them into HDFS. Then, the Plan Center use the data by the Spark RDD to build strategy models.

In addition to above four tiers, the HFT system includes the unified messaging middleware that is implemented by Kafka. The messaging system of the HFT system is critical because the transmission of messages is high-frequency. This architecture is pretty complicated and delicate. We implement many cloud computing technologies and we will describe them in following sections.

Lightning Calculation for Market States and Low Latency Storage──State Center

State Center is in charge of calculation of real-time quotes and saving the result market states into database. In order to retain a span (at least 500 milliseconds) for following machine learning model execution, it must satisfy the requests of fast calculation that is limited to tens of milliseconds (exclude transmission overhead time).

Originally, we implement State Center by Storm as a single topology in order to calculate market states fast. Figure 5 shows the whole State Center topology initially.

In the topology, the KafkaSpout receives real-time quotes from the external RealtimeDataPublisher service that subscribes for real-time quotes from true market and continually sends the tick prices through the distributed messaging system, Kafka.

The KafkaSpout then passes the prices to following 18 ComputeStateBolt. Each ComputeStateBolt carries different ComputeStateLogic and uses it to calculate the market states that are defined in the specific TA logic. The 18 ComputeStateBolt then send the result market states to specific TA WriteDataBolt. Note that every WriteDataBolt writes corresponding TA data into HBase. For example, The MA

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

(Moving Average, is a technical analysis tool) ComputeStateBolt sends market states to MAWriteDataBolt that exclusively stores MA market states. Outside the topology, all black lines are transmitted by Kafka distributed messaging system. Besides, Netty is the messaging channels inside the Storm topology.

Figure 5: State Center – A Topology

After our improvement for the State Center, we combine all ComputeStateBolt into a SymbolComputeStateBolt. The SymbolComputeStateBolt is responsible for the all TA calculation of 1 or N futures. We call that “State Center – B”. Figure 6 shows the new topology.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 6: State Center – B Topology

The performance of the “State Center – B” is better than previous “State Center A”.

We improve and simplify the messages transmission between the KafkaSpout and the SymbolComputeStateBolt, which reduce a lot of data transmission time. The Experiments chapter shows the performance result of the State Center A and B.

Build Model Fast Over Big Data Sets──Plan Center

The investment strategies, or said investment models, are very critical for HFT system. Plan Center takes charge of the creation of investment strategies. By machine learning algorithm, Plan Center uses historical market data to build investment strategies. Also, it provides many customized algorithm parameters for the users. There are two requirements of the created strategies we need to satisfy in HFT system:

1. Fast: We need to create the models before the trend changes because the trends are so transient and easy to miss in high-frequency trading.

2. Big data loading in short time: When Plan Center runs machine learning algorithm to create investment strategies, Plan Center need to load large historical data sets from HBase and analyze them simultaneously in a short time.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

To solve the fast and large data sets problem, we implement Plan Center by Apache Spark framework. By Spark RDD, Plan Center can load hundreds of gigabytes market states data into memory and analyze them across many nodes in cluster [6] [7].

Plan Center offers several machine learning algorithms for the creation of investment strategies, such as Support Vector Machine (SVM), Logistic Regression (LR) and Classification.

Forecast Market Trends Rapidly and Accurately──Trade Center

After the investment strategies are produced, the users can choose the investment strategies that they want to use on web pages. The Original architecture of Trade Center shows in Figure 7.

Figure 7: The Original Design for Trade Center

The Integration of State Center and Trade Center

The cost of time for pulling data from Kafka queues is extremely short and below tens of milliseconds commonly. But the Netty messaging system inside the Storm has

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

much more rapid transmission speeds than Kafka in usually several milliseconds from one vertex to another. In order to reduce the overhead time of market states transmission, we combine State Center topology and Trade Center topology into one large topology, named HFT system.

The HFT system actually is a large topology that pulls data from Kafka queues and writes data into HBase and MongoDB. Figure 8 shows the architecture of whole HFT topology. By the combination of State Center and Trade Center, the cost time of messages transmission is reduce to several milliseconds. The complete experiments data are described in detail on chapter: Experiments.

Figure 8: The architecture of the HFT topology

Cluster Resources Management

Because of the complexity of the HFT architecture, we use YARN to launch most services and manage all resources on cluster. Also, we monitor the health of each node

‧

services, monitor our services status on web pages and operate the cluster services easy [8].

EXPERIMENTS

In the previous section, we introduced the requirements of the high frequency trading and the detail of the n-tier architecture that used on the high-frequency trading. In the HFT system, the State Center computes market states and then the Trade Center uses the market states to forecast the market trends. The whole process must be finished in 1 second because the HFT system forecasts the future trends after 1 second. As a result, the State Center must do all calculation in 500 milliseconds and the Trade Center is the same. Because of the State Center needs to calculate large number of market states, we mainly examine the performance of the State Center. In this section, we present the experimental results to show that the architecture’s performance is good enough for high-frequency trading.

Experimental Environment

For evaluating the architecture’s performance, this research proposed several experiments. Because of the high-frequency and real-time requirement, this system must have the ability that handles considerable requests in a very short time. Specially, on State Center, we need to calculate millions of market states within one second. As a result, this research specifically designs experiments for the State Center. We will compare the results with different numbers of futures and figure out how powerful this architecture can be. To implement the experiments, we prepare 8 computers as the cluster and 6 of them as the supervisors to run the Storm topology. More information of execution environment is shown on table 1.

‧

Table 1: The detailed information of the cluster

To test the extreme efficiency of this architecture and find out the most appropriate configuration for the cluster, we compare the average computing time of all market states for each experiment. We add a bolt into the architecture to collect metric data of the market states computation.

Implementation of the Experiments

In order to implement this analysis, we add a new bolt, named ExpStateReceiverBolt, into our original topology to collect all computing metric data of market states. While all market states of one k-bar arrived, this bolt will sum the accumulated cost time of these k-bars and calculate the average value. The Figure 9 shows the performance result of State Center – A, the Figure 10 shows the performance result of State Center – B and the Figure 11 shows the numbers of market states under N futures. The Table 2 shows the experimental comparison between the State Center – A and the State Center – B.

Host CPU RAM HDD OS Storm

nccu-n01 Intel(R) Core(TM)2 Quad