Chapter 1 Introduction
1.4 Thesis Organization
The rests of this thesis are organized as follow. Chapter 2 introduces the background of FNNs which is used in this thesis. The proposed methodology is introduced in chapter 3.
Chapter 4 shows our experimental results and chapter 5 concludes the contribution of this thesis.
6
Chapter 2
Background & Preliminary
This chapter introduces the background and motivation of this work. Section 2.1 introduces the GPGPU computing platform, NVIDIA Fermi architecture and its programming environment CUDA. Section 2.2 introduces SONFIN, which is a classical type of FNN and widely used in many domains.
2.1 GPU Computing
Fig. 2. NVIDIA Fermi architecture.
7
Graphic Processing Unit (GPU) is originally designed for computer graphics only. The computations in graphic based applications are often independent, massive and regular. Hence, the designs of GPUs architecture always focus on computations. On the other hand, the CPU needs to handle more complicated controls. So the major difference between CPU and GPU is that, GPU issues a lot of simple processing elements but CPU consists of few complex processing units. However, due to the increasing complexity of general applications, the runtime of CPU is getting longer. For this reason, GPU has been applied to various algorithms in many areas, and this kind of GPU is called general purpose GPU (GPGPU).
2.1.1 Fermi Architecture
NVIDIA is one of the companies that focus on the design of GPGPU. An architecture announced by NVIDIA Corporation [18] is named Fermi. The Fermi architecture is a single-instruction-multiple-thread (SIMT) system as shown in Fig. 2; it contains several streaming multiprocessors (SMs). At the same time, all the SMs can share a unified L2 Cache and DRAM. There are 32 cores, four special function units and a 64KB local memory which is only shared among all the CUDA cores in a SM. Each of these cores can be launched in parallel with a huge amount of threads. For example, NVIDIA Tesla C2050 can launch up to 1536 threads per SM. Meanwhile, NVIDIA provides programmers with the CUDA programming Model [19] so that the programmers can control thousands of threads on the GPGPUs through defining the thread hierarchy in the CUDA programming model.
8
2.1.2 CUDA Programming Model
CUDA is a parallel programming model that can be run on any number of processors without recompiling. As shown in Fig. 3, parallel portions of an application are executed on the device as CUDA kernels. In a CUDA kernel, programmers have to define the CUDA thread hierarchy. The right hand side of Fig. 3 shows the CUDA thread hierarchy which contains three levels, grid, block and thread. A CUDA kernel is executed by an array of threads, and all the threads run the same code. Each thread has its own ID that is used to compute memory addresses and make control decisions. Fig. 4 shows an example of device code which is executed in every thread, and the device function call which is used in the main to launch the CUDA kernel. While using a device function call, the configuration of each CUDA kernel is defined in the “<<< >>>”. CUDA supports several standard languages and APIs, such as C, OpenCL, Fortran and DX compute, and we use CUDA C to implement our
Fig. 3. CUDA programming model.
9
program in this thesis. And CUDA is supported on common operation system, such as Windows, Mac OS and Linux.
During the execution of a CUDA kernel, block scheduler issues several thread blocks to each SM, and each SM further divides thread blocks into warps, which consist of 32 threads, to carry out a fully parallel execution on the cores. In the Fermi architecture, there are several restrictions on the maximum number of blocks, warps and threads on each SM which are different with different computing capability. For example, on a NVIDIA Tesla C2050 graphic card, the maximum number of thread blocks, warps and threads are 8, 48 and 1533 respectively.
According to the official CUDA programming guide [20], occupancy shows how effective the hardware is kept busy. It is a ratio of active warps to limit warps which is the maximum number of warps on a SM. The definition of occupancy is
When defining the CUDA thread hierarchy, the size of thread blocks highly influences the Fig. 4. Device code and device function call.
10
occupancy. For example, limit warps is 48, maximum number of thread blocks is 8, block size is 32, than there will be 8 thread blocks and 8 warps issued on a SM, so the occupancy is 0.0667. Another example, limit warps is 48, block size is 192, than there will be 8 blocks and 48 warps issued on a SM, so the occupancy is 1. Although higher occupancy does not always equate to higher performance, low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation [21].
2.2 The self-constructing neural fuzzy network (SONFIN)
Fig. 5 shows the modified structure of the SONFIN. The original SONFIN structure has six layers. In order to make the comparison between serial and parallel version easier, we reduce the number of layers from six to four. The form of each fuzzy rule in the SONFIN is:
Rule R : if x1 is Ak1 And, …, And xr
is A
knThen y
1 is wrl, k = 1, …, rwhere Akj is a fuzzy set. wkl is a real number, and r is the total number of rules.
The SONFIN is a general connectionist model of a fuzzy logic system, which can find its
Fig. 5. Modified structure of the SONFIN.
11
optimal structure and parameters automatically. The function of each layer is described below.
Layer 1:
One node corresponds to one dim. No computation is done in this layer, and each node transmits input values to the next layer.
Layer 2:
Each node corresponds to one fuzzy set and calculates a membership value. That is, the membership value which specifies the degree to which input value belongs a fuzzy set is calculated in this layer. The fuzzy set Akj is employed with the Gaussian membership function:
where mrj is the center of the fuzzy set and the denotes the width of the fuzzy set. So the number of fuzzy sets in each dim is equal to the number of fuzzy rules.
Layer 3:
A node in this layer represents one fuzzy rule and performs antecedent matching of a rule. The following AND operation is performed for each node in layer 3:
12
where L is the number of output dimension.
Fig. 6 shows the flow chart of the SONFIN. The flow chart includes the structure learning and parameter learning. In the structure learning, there are no rules initially, and rules are constructed by the structure learning. The firing strength in layer 3 is used to decide whether a new fuzzy rule is generated. If , a new rule
Fig. 6. Flow chart of the SONFIN
13
The parameters are updated using the equation below:
where is a learning constant which influences the converging speed of the gradient decent algorithm.
2.3 Related Work
FNNs have been studied for decades, and many FNNs with different propertied have been proposed. ANFIS (Adaptive Neuro-Fuzzy Inference Systems) was proposed in 1993 [2].
It is a widely used FNN which only performs parameter learning. ANFIS uses a hybrid learning procedure to build the connection between training samples and network output based on both human knowledge and stipulated input-output data pairs. However, the number of fuzzy rules of FNNs with only parameter learning increases exponentially with the dimension of input space. FNNs with structure learning ability [3]~[16] have been proposed
14
to reduce the number of generated rules.
Recently, there are some researches focusing on parallel neural networks [22][23] and parallel fuzzy neural networks [17][24]. [22] implemented the parallel neural network by mapping the inner-product operation into a matrix multiplication operation. [23] provided an implementation of the back-propagation algorithm on CUDA, and the author claimed that the number of threads should be as large as possible to enable the CUDA scheduler to better utilize the available computational power. The first adaptation of the Fuzzy ARTMAP neural network on a GPGPU was proposed in [24]. Juang and Chen [17] proposed an implementation of a zero order Takagi-Sugeno-Kang-type fuzzy neural network on GPU. To our best knowledge, Juang is the first work which gives a detailed SONFIN design on a GPGPU. This paper uses this work as the baseline design to compare the experimental results.
However, the performance of a parallel application on the GPGPU heavily depends on how the developer manages blocks of threads and how effective the GPGPU hardware resource is used. The thread mapping in NVIDIA CUDA kernel determines how much parallelism can be exploited by a GPGPU. The thread mapping of the GPU-FNN [17] was partitioned based on fuzzy rules. In this way, each fuzzy rule in a FNN is mapped on a thread block. GPU-FNN can make good use of the parallel fuzzy rules in some cases, for example, 192~768 dim with NVIDAI Tesla C2050. However, the range of dim of different applications can vary significantly. For example, an artificial detection might have tens of dim [25], while the protein mutant data set could involve more than five thousand dim [26]. The design of a parallel FNN needs to cover all the possible range of dim from different applications.
Moreover, the current version of CUDA programming model limits a thread block to accommodate up to 1,024 threads. Therefore, the mapping method proposed in [17] cannot support the application when dim is too high, such as the protein mutant application [26].
15 architecture into account. In the [17], blocks of threads are partitioned based on fuzzy rules so thatthe hardware is not efficiently used with some training samples. In summary, the thread mapping mechanism of these works cannot adapt to training samples with different characteristics and architecture with different features.
The performance of a parallel application on a GPGPU is highly dependent on how effective the created threads could exploit the GPGPU architecture. The decisive factor is the thread mapping mechanism, which connects a multi-thread application to the underlying many-core system. This becomes a non-trivial problem when implementing an FNN onto a GPGPU. The main design concerns can be characterized as follows:
(1) Parallelism and coordination. The way an application is parallelized and how the concurrent computation is coordinated also plays important roles in the GPGPU computing. A well parallelized application can reduce the computation burden on GPGPUs and achieve superior performance enhancement. However, an inappropriate design may ignore important design issues, such as insufficient parallelism or severe resource contention, and cause degraded performance.
(2) Thread mapping. Although FNNs have massive computation parallelism, the thread mapping must be well designed because it could significantly influence the efficiency of hardware utilization. However, an efficient thread mapping design of a parallel FNN is not straightforward, it must take many factors into concern, such as compute capability of the used GPGPU, version of CUDA and the dim and number of rules of the learning FNN.
16
(3) Adaptability and scalability. It is predicted that the number of cores in a GPGPU will scale with the advances of semiconductor technology. The performance of a multi-threaded FNN should automatically scale with the enormous cores provided by the future GPGPUs without redesign overhead. Moreover, the design should also be adaptable to the changing number of fuzzy rules and dim of an FNN.
Because of these design concerns abovementioned, we propose an architecture-aware thread mapping methodology for FNNs on GPGPUs which can create efficient coordination between concurrent computations and hardware on GPGPUs based on the training samples with different characteristics and the architecture of GPGPUs with different features.
17
Chapter 3
Our Proposed Approach
Fig. 7 shows a design flow to parallelize and optimize FNNs on a GPGPU. This thesis uses the modified SONFIN as the main application to demonstrate the effectiveness of the proposed design flow. This flow considers several important issues of FNNs using GPGPU computing. The design flow starts from a sequential FNN application. The first stage shown in the Fig. 7 is necessary in every parallel design to decide which parts of the program should be parallelized and executed on GPGPUs. The shaded block on the right hand side of Fig. 7 is the Architecture-Aware Thread Mapping (ATM). The ATM performs optimizations for each CUDA kernel, and contains four stages, 1) fine-grain task decomposition, 2) special function transformation and memory coalescing, 3) task coarsening and 4) task to thread binding. The detail of the design flow will be discussed in the following sections.
18
3.1 Bottleneck Analysis
The first stage of the ATM is the bottleneck analysis. This stage is almost the most important and essential stage while designing a parallel program. The bottleneck analysis is to find out bottlenecks that dominate the runtime of the total program. According to Amdahl’s law, if a fraction f is accelerated by a factor of S, the overall performance speedup is:
In the parallel computing, the is the fraction that can be parallel in a sequential program.
And the fraction can be accelerated S times after parallelization. Factor S is decided by how much parallelism of the f part and how well the GPGPU architecture can be exploited. Larger could increase the impact of S on the overall performance. So it is important to find out which parts have the greatest f through the bottleneck analysis.
Fig. 7. Design flow of a parallel SONFIN on a GPGPU
19
Because the conformation of an FNN is fixed no matter how many dim and number of rules. So the easiest and the most efficient way to catch the computation behavior is to profile the timing information of an FNN with a small bench which has small dim and little number of rules. Fig. 8 shows the flow chart of the SONFIN, and we found two bottlenecks, the gaussian member function and the update parameter, by bottleneck analysis. And their pseudo code is shown in Fig. 9. Through our profiler, the gaussian member function takes about 80%
and the update parameter takes about 15% of the total runtime.
Input
Fig. 8. Flow chart of the SONFIN
20
3.2 GPGPU Partitioning
Using the result of bottleneck analysis, we can have a initial partitioning. The gaussian
member function and the update parameter have been recognized as the two bottlenecks that
should be parallelized as CUDA kernels on GPGPU. However, in addition to the execution time of individual function block, the partitioning of GPGPU and CPU should also consider issues such as the effectiveness of parallel part of a program and data transfer between a device and a host. However, we perform the partitioning based on the rule of thumb. Besides the two bottlenecks of the SONFIN, the rule firing strength calculation is also moved to the GPGPU in our partitioning. This is because the amount of data transfer between Gaussian member function and rule firing strength calculation is larger than the amount of data transfer between rule firing strength calculation and new rule decision. Based on the above analysis, the final partitioning is shown in Fig. 10. The following subchapter will use this partitioning scheme to perform optimizations in the ATM methodology.(a) Gaussian member function (b) update parameter
Fig. 9. Pseudo code of (a) gaussian member function and (b) update parameter
21
Fig. 10. GPGPU partitioning of the SONFIN
22
3.3 Fine-Grain Task Decomposition
The first stage of the ATM is decomposition. It defines how the computations executed simultaneously on the GPGPUs. Recall that the pseudo codes of the two bottlenecks are shown in Fig. 9, and it can be seen that they have two nested for loop. So in each CUDA kernel, we use a 2-D matrix, Task Matrix (TM), to represent the overall computation. The definition of TM is:
Definition 1 (Task Matrix) A Task Matrix is a 2-D matrix which is used to stand for the
overall computation of a CUDA kernel. It is a matrix, where r is total number of rules and dim is total number of input attributes. And each element in a TM is named Task which is defined in the definition 2.
Definition 2 (Task) A task is a computation of one input dimension of one rule. Therefore, T
ijis the computation of rule and input attribute pair (i, j). For example, T
23is the computation of the 2th rule to the 3th input attribute.
We decompose the parallelism of each CUDA kernel in the most fine-grained way by defining the TM and the Tasks. The reason is that the fine-grained decomposition extends the
Fig. 11. Task Matrix of the SONFIN with 3 dim and 3 rules
23
space of configuration of each CUDA kernel. We later use the task coarsening to search the configuration space of each CUDA kernel. So the configuration with better performance can be easily found. The searching procedure will be discussed in the section3.5. For the parallel SONFIN, Fig. 11 shows an example of Task Matrix with 3 dim and 3 rules, and the Task
Matrix is a 3x3 Matrix.
3.4 Special Function Transformation & Memory Coalescing
The second stage of the ATM contains two optimizations, special functions transformation, and memory coalescing. Because these two optimizations are independent, they are designed in the same stage, and can be performed simultaneously. The purpose of special function transformation is to utilize the special function hardware which is faster than the compiled ptx code to speed up the mathematical operations. The special functions transformation uses the library supported by CUDA, such as addition, subtraction, multiplication, division and other mathematical operations. The special functions units are faster than the standard functions because the special functions directly use the special function units on the GPGPU. However, the number of special function units on the GPGPU is limited, so the number of special functions which are changed from standard functions is limited. This problem is like an simple version bin packing problem, so we can use the first fit algorithm which is a straightforward approach to select the most effective special function transformation.
24
memory transaction. As an example shown in Fig. 12 (a), assume the data which are needed by the half warp is scattered to memory, so there require total sixteen memory accesses. Fig.
12 (b) illustrates the case that the data needed by the half warp is stored adjacently in the
memory, so all the memory access will be packed into only one memory transaction.
According to the regular data structure of FNN, the memory coalescing can be done by (a) non-coalesced memory access
(b) coalesced memory access
Fig. 12. Coalesced memory access. (a) non-coalesced memory access. (b) coalesced memory access.
Fig. 13. Data layout of parallel SONFIN
25
designing the data layout. Generally there are two styles of data layout which are used in regular data, row-wise and column-wise. According to the introduction of memory coalescing before, as long as the data of adjacent threads are stored in the adjacent memory address, the memory access is optimized by memory coalescing. However, the number of rules will change during the learning, so the data layout is limited by the direction of dim. Fig. 13 shows our data layout of the parallel SONFIN.
3.5 Task Coarsening
The third stage of the ATM is task coarsening. The coarsening is an optimization technique to fine-tune the parallelism decomposition. Fig. 14 shows the concept of task coarsening. A single block represents a computation; a task is composed of three common computations and one unique computation. There are four tasks which are executed simultaneously, and the amount of total computations is after three time
Fig. 14. Task coarsening.
26
stamps. If we coarsen two tasks into one task, the common computations are just executed once and stored into registers. So a task becomes a task’ which is consist of three common computations and two unique computations. Therefore, the amount of total computations is reduced to after the task coarsening. Moreover, because the number of common computations is decreased, so the memory accesses in the common computations are also decreased. However, there are some overheads to perform task coarsening: the indexing will cost extra computation in our last stage, task to thread binding.
In summary, there are two benefits of the task coarsening, 1) reduce the total amount of
In summary, there are two benefits of the task coarsening, 1) reduce the total amount of