中華大學

(1)

中華大學博士論文

平行分散式與多核心系統上的 GEN_BLOCK 資料重新分配與最佳化技術

Developing GEN_BLOCK Redistribution Algorithms and Optimization Techniques on Parallel, Distributed and Multi-Core Systems

系所別：工程科學博士學位學程學號姓名：D09424005 陳世璋

指導教授：許慶賢博士

中華民國九十九年九月

(2)

Developing GEN_BLOCK Redistribution Algorithms and Optimization Techniques on Parallel, Distributed and Multi-Core Systems

By

Shih-Chang Chen

Advisor: Prof. Ching-Hsien Hsu

Ph. D. Program in Engineering Science Chung-Hua University

707, Sec.2, WuFu Rd., HsinChu, Taiwan 300, R.O.C.

September 2010

(3)

i

摘要

平行系統已經被利用來處理龐大計算資料和複雜的科學運算問題，以改善單機運算程式上的緩慢作業速度。隨著電腦硬體架構的進步，單一電腦叢集、多個電腦叢集或是擁有多核心的主機所組成的多叢集環境都可以當作一種平行運算系統。對於一個有多個運算階段的平行程式而言，在平行系統上，適當的資料分配是影響平行程式表現最重要的因素。因為每個運算階段彼此不盡相同，所以要求的最佳的資料分配方式也有所不同。在資料重新分配的過程中，為了達到系統上的最佳負載平衡、資料最佳化和降低在執行過程中的通訊成本，資料分配的演算法扮演著平行運算中重要的一環。因此，在這份研究中，我們提出了一套有效的通訊訊息解析公式，以及對單一電腦叢集、多個電腦叢集和擁有多核心的主機所組成的多叢集系統提出了三個通訊資料排程演算法以及節能技術，以解決不規則的資料重新分配要面對的問題。通訊訊息解析公式可在進行排程之前提供許多資訊，如訊息的來源端和目的端。每個平行系統中的節點都能簡單地、有效地獨自利用公式取得所需資訊。在單一電腦叢集系統上，有效率的排程演算法則能為異質性的環境提供最小排程步驟和較少通訊成本的排程。而多個電腦叢集本身擁有複雜的網路環境和運算設備，為了適應該架構，改善的既有的排程方法之缺點，我們也提出一個新的排程方法。這個方法將通訊分成三個類別，分別是發生在單一節點、同一電腦叢集的兩個節點和兩個電腦叢集上的節點的通訊作業，這個方法同時也改善了資料重新分配上的通訊同步問題。此外，在不規則的資料分配研究裡，現在仍未有多核心的主機所組成的多叢集系統和節能的相關研究。因此，通訊類別在這個研究中被重新定義，新的排程演算法進行排程時，節能方法也適時的提供電壓控制方案，為這複雜系統上的每個核心以提供更好的節能效果。

關鍵字: 電腦叢集與多核心平行運算系統、不規則的資料重新分配、負載平衡、

排程演算法、節能

(4)

ii

ABSTRACT

Parallel computing systems have been used to solve complex scientific problems with aggregates of data such as arrays to extend sequential programming language . With the improvement of hardware architectures, parallel systems can b e a cluster, multiple clusters or a multi-cluster with multi-core machines. Under this paradigm, appropriate data distribution is critical to the performance of each phase in a multi-phase program. Because the phases of a program are different from one another, the optimal distribution changes due to the characteristics of each phase, as well as on those of the following phase. In order to achieve good load balancing, improved data locality and reduced inter-processor communication during runtime, data redistribution is critical during operation. In this study, formulas for message generation, three scheduling algorithms for single cluster, multiple clusters and multi-cluster system with multi-core machines and a power saving technique are proposed to solve problems for GEN_BLOCK redistribution. Formulas for message generation provide much information of source, destination and data which are needed before scheduling algorithms giving effective results. Each node can use the formulas to obtain the information simply, effectively and independently. An effective scheduling algorithm for a cluster system is proposed to apply on heterogeneous environment. It not only guarantees minimal schedule steps but also shortens communication cost. Multi-cluster computing provides complex network and heterogeneous processors to perform GEN_BLOCK redistribution. To adapt this architecture, a new scheduling algorithm is proposed to provide better result in terms of communication cost. This technique classifies transmissions among clusters into three types and schedules transmissions inside a node together to avoid

(5)

iii

synchronization delay. While employing multi-core machines to be a part of parallel systems, present scheduling algorithms are doubted to deliver good performance. In addition, efficient power saving techniques are not under consideration for any scheduling algorithms. Therefore, four kinds of transmission time are designed for messages to increase scheduling efficiency. While performing proposed scheduli ng algorithm, the efficient power saving technique is also executed to evaluate the voltage value to save energy for each core on the complicated system.

Keywords: Multiple cluster and multi-core parallel computing system, GEN_BLOCK redistribution, Load balancing, Scheduling algorithm, Power saving.

(6)

iv

Acknowledgements

First of all, I would like to thank Prof. Ching-Hsien Hsu who is one's adviser. Prof.

Hsu is a conscientious and careful scholar. He provides many helpful suggestions not only for the thesis but also for one's life. One is fortunate to be Prof. Hsu’s Ph. D.

student.

I wish to express my sincere appreciation to oral examination committee members for paying great patient and time for one's oral presentation.

My cordial appreciation goes to members of Lab. members, they always give me support when I works on the thesis.

Finally, I would like to thank my girl friend Chien-Hui Lin and my family for giving me great support, showing tolerance for me and giving me power to get through this 5 years

(7)

v

List of Tables

Table 5.1: Configurations of data sets ... 57 Table 6.1: Configurations of data sets ... 74

(10)

viii

List of Figures

Figure 2.1: A bipartite graph representing communication patterns ... 10

Figure 2.2: A bipartite graph with Degreemax=3 ... 11

Figure 2.3: The global index of array for given schemes ... 12

Figure 3.1: Data distribution schemes for source and destination processors ... 16

Figure 3.2: 1-D array decomposition ... 17

Figure 3.3: 2-D array decomposition with source distribution scheme {2, 6}*{2, 4} ... 20

Figure 3.4: 2-D array decomposition ... 21

Figure 3.5: Area indicated by equation (3.9) on a two dimensional array ... 22

Figure 3.6: In centralized process, n*n pairs of communications are obtained by one node ... 23

Figure 3.7: In decentralized process, SP0 obtains at most n pairs of communications ... 24

Figure 3.8: In decentralized process, DP0 obtains at most n pairs of communications... 24

Figure 3.9: The time complexity of four condictions ... 25

Figure 4.1: A directed bipartite graph represents the relation between SP and DP ... 29

Figure 4.2: First phase of TPDR ... 30

Figure 4.3: Second phase of TPDR ... 31

Figure 4.4: Consecutive bipartite graphs ... 35

Figure 4.5: Statistics of comparisons ... 39

Figure 4.6: The summation of total cases ... 40

Figure 4.7: Results with different data size range of each node ... 43

Figrue 5.1: Transmissions with LAT, RAT and DAT in the grid system ... 48

Figure 5.2: The transmitting rate of LAT, RAT and DAT in environment I ... 49

Figure 5.3: The transmitting rate of LAT, RAT and DAT in environment II ... 49

Figrue 5.4: Communication patterns with modified cost ... 51

Figure 5.5: A bipartite graph representing communications between source and destination processors ... 53

Figure 5.6: A schedule of TPDR ... 53

Figure 5.7: A schedule of TPDR with modified cost ... 54

Figure 5.8: A schedule which is given by MLS ... 54

Figure 5.9: A bipartite graph with Degreemax = 4 ... 56

Figure 5.10: Comparisons of MLS and TPDR with NC = 3 ... 59

Figure 6.1: Intel Quad-Core CPU system structure ... 67

(11)

ix

Figure 6.2: AMD Quad-Core CPU system structure ... 68

Figure 6.3: The architecture of multi-core machines in two clusters in gird system ... 71

Figure 6.4: A schedule of DVC ... 72

Figure 6.5: Values of Rm in each step ... 73

Figure 6.6: Comparisons of DVC and TPDR with NC=8. ... 76

Figure 6.7: Comparisons of DVC and TPDR with NC=16 ... 77

Figure 6.8: Improvement of DVC on power consumption with NC = 8 ... 79

Figure 6.9: Improvement of DVC on power consumption with NC = 16 ... 80

(12)

1

Chapter 1 Introduction

Parallel computing systems such as PC clusters provide powerful computational ability to solve various kinds of scientific problems. A complex scientific problem may be consisted of several computation phases, each phase is responsible to execute different algorithm and may require different set of data. In order to achieve better load balance and communication locality, it is necessary to redistribute data according to the desired data distribution scheme. Generally, data distribution can be classified into regular distribution and irregular distribution. The regular distribution employs BLOCK, CYCLIC and BLOCK-CYCLIC(c) to specify array decomposition. The irregular distribution uses user-defined function to specify unevenly array decomposition such as GEN_BLOCK.

1.1 Motivations

Researches of GEN_BLOCK redistribution are classified into index generation and scheduling algorithm, generally. Much information of source, destination and data are needed before scheduling algorithms giving efficient results. This thesis provides multi-dimensional index generator formulas for GEN_BLOCK redistribution because rare researches are given in the past. Each node can have the information of messages which it exchanges with others including source node, destination node and data size by proposed formulas. Formulas are simple and nodes can derive these information independently.

Efficient scheduling algorithms keep an eye on load balancing and provide schedules with low communication cost. They help shorten the total communication time and avoid synchronization delay among processors. The estimated cost of schedules determines the

(13)

2

performance of algorithms. In order to reduce the cost, many scheduling algorithms were proposed to give better schedules for GEN_BLOCK redistribution. In this thesis, an efficient and simple method is proposed for single cluster based computing platform because previous methods have their own shortcomings. The proposed method can apply on heterogeneous environment, guarantee minimal schedule steps and shorten communication cost effectively.

The property of network and heterogeneity of processors are paid close attention because the multi-cluster computing becomes popular. It is hard for present scheduling algorithms to derive effective results for such environment due to they are not designed for this architecture. The performance of these algorithms for multi-cluster computing is doubted. Therefore, a new scheduling technique is proposed to adapt the complex parallel systems. This technique classifies transmissions into three types i.e.

transmissions happened inside a node, across nodes and across clusters. The size of first type is relative small in practice but not in algorithm level. It will cause synchronization delay since the transmission of such type is delivered and the largest message still on the way, resulting in processors idle time. The proposed technique schedules transmissions inside a node together to avoid synchronization delay.

With the improvement of hardware architectures, multi-core computing node becomes popular for parallel systems recently. Plenty of researches designed scheduling algorithms for GEN_BLOCK redistribution for single-core parallel systems. However, algorithms designed for such scenario is doubted whether they adapt to multi-core parallel system. In addition, efficient power saving techniques are not under consideration with any scheduling algorithms. While applying them on a huge parallel system (e.g. grid systems), it will be a lot of waste not only on time but also on energy consumption.

Therefore, an energy efficient scheduling technique is proposed for performing

(14)

3

GEN_BLOCK redistribution on multi-core parallel systems. The proposed method designs four kinds of transmission time and a scheduling policy for messages to increase scheduling efficiency. With derived schedules, the evaluated the voltage value are given to reduce power consumption for each core on the complicated system.

1.2 Related Works

Many methods for performing array redistribution, transmission managements and power saving have been presented in literature. The array redistribution research was mostly developed for regular or irregular problems [15] in multi-computer compiler techniques or runtime support techniques. The power saving research was mostly discussing working, idle time and power waste problems in some architectures.

Below is a description of the related works according to these three aspects.

Techniques for regular array redistribution, in general, can be classified into three approaches: the communication sets identification techniques, message packing and unpacking techniques and communication optimizations. The first includes the PITFALLS [38] and the ScaLAPACK [37] methods for index sets generation; Park et al. [35]

devised algorithms for BLOCK-CYCLIC data redistribution between processor sets;

Dongarra et al. [36] proposed algorithmic redistribution methods for BLOCK-CYCLIC decompositions; Zapata et al. [4] proposed parallel sparse redistribution code for BLOCK-CYCLIC data redistribution based on CRS structure. Lin and Chung [31]

proposed CFS and ED for sparse array distribution. The Generalized Basic-Cycle Calculation method was presented in [18]. Bai et al. [3] also constructed rules for determining communication sets of general data redistribution instances. In [2], they also proposed ECC method for a processor to pack/unpack array elements efficiently. Jeannot et al. [24] presented a comparison between different existing scheduling methods. Reddy

(15)

4

et al. [40] presented a tool which coverts ScaLAPACK programs to HeteroMPI programs and to run on heterogeneous networks of computers with good performance improvements.

Techniques for communication optimization, in general, use different approaches to reduce the communication overheads in a redistribution operation. Examples are the processor mapping techniques [16, 21, 22, 25] for minimizing data transmission overheads, the multiphase redistribution strategy [28] for reducing message startup cost, the communication scheduling approaches [13, 30] for avoiding node contention and the strip mining approach [46] for overlapping communication and computational overheads.

Huang and Chu [23] proposed an efficient communication scheduling method for block-cyclic redistribution. Souravlas et al. [43] presented a pipeline based message passing strategy in tours net. Karwande et al. [27] presented CC-MPI to support compiled communication technique. Sundarsan et al. [44] derived a contention-free algorithm for distribution two-dimensional block-cyclic arrays for 2-D processor grids on a prototype framework, ReSHAPE. Data distributed in symmetrical matrices format requires efficient techniques to decompose. Hsu et al. [19] proposed a node replacement scheme to reduce overall cost.

In irregular array redistribution, some studies have concentrated on the indexing and message generation while others have addressed communication efficiency. Guo et al.

[17] presented a symbolic analysis method for communication set generation and reducing the communication cost of irregular array redistribution. In communication efficiency, Lee et al. [29] presented a logical processor reordering algorithm for GEN_BLOCK. Six algorithms for reducing communication cost are discussed. Guo et al. [47] proposed a divide-and-conquer algorithm for performing GEN_BLOCK redistribution. In this method, communication messages are first divided into groups using the Neighbor Message Set (NMS). Messages with the same sender or receiver will be grouped into an NMS. The

(16)

5

communication steps are scheduled after merging the NMSs with no contention. In [52], a relocation algorithm is proposed by Yook and Park. The relocation algorithm consists of two scheduling phases, the list scheduling phase and the relocation phase. The list scheduling phase sorts global messages and arranges them into communication steps in decreasing order. Because it is a conventional sorting operation, list scheduling is not a complex algorithm. In the case of a contention, the relocation phase will perform rescheduling for those conflicting messages. While the algorithm is in relocation phase, an appropriate location will be allocated for the messages that can’t be scheduled in the list scheduling phase. This leads to high scheduling overheads and degrades the performance of a redistribution algorithm. Wang et al. [48] proposed an improved algorithm based on the relocation algorithm. Similar to the relocation algorithm, their proposed approach also has two phases, but the second phase will not be active in case of a conflicting message. The second phase allocates conflicting messages after the first phase is finished.

In other words, the first phase allocates messages that will not result in contentions and then the algorithm flows to the second phase. Because of combining the divide-and-conquer and relocation algorithms, this algorithm provides more choices to redirect messages and reduce the relocation cost. Authors especially mentioned that it is an NP complete problem to obtain the optimal schedule. The multiple phases scheduling algorithm was proposed by Hsu et al. [20] to manage data segment assignment and communications to enhance the performance. Yu [53] defined a 2-approximmation algorithm for MECP, CMECP and SCMECP which can be used to model data redistribution problems. Cohen et al. [12] studied the KPBS problem and KPBS is proven to be NP-hard. In this research, at most k communications are supposed to be performed at the same time between clusters interconnected by a backbone. Chang et al. [10]

proposed HCS and HRS to improve the solutions of data access on data grids. Rauber et

(17)

6

al. [39] presented a data redistribution library for multiprocessor task programming to split processor groups and to map tasks to processor groups.

For larger scale of system such as grid and cloud, Assuncao et al. [1] evaluated six scheduling strategies by investigating them with the use of Cloud resource. Grounds et al.

[14] tried to satisfy the requests of numerous clients who submit workflows with deadlines.

Authors proposed a cost minimizing scheduling framework to compact the cost of workflows in a Cloud. Liu et al. [33] proposed GridBatch to solve the large-scale data-intensive batch problems in Cloud. GridBatch help achieve good performance in Amazon’s EC2 computing Cloud which represented a real client. Wee et al. [49]

proposed a client-side load balancing architecture to improve existing techniques in Cloud.

The proposed architecture met QoS requirements and handle failure immediately. Yang et al. [51] proposed a six-step scheduling algorithm in for transaction-intensive cost-constrained Cloud workflows. The proposed algorithms provided lower execution cost with various deadlines of workflows. Chang et al. [9] proposed a Balanced Ant Colony Optimization algorithm to minimize the makespan of jobs for computing grid.

The proposed method simulates the behavior of ants to keep load balancing of whole system. The desktop grid computing environment has a nature limitation that volunteers can leave any time. The characteristic results in unexpected job suspension and unacceptable reliability. Byun et al. [7] proposed the Markov Job Scheduler Based on Availability (MJSA) to solve these limitations. Brucker et al. [5,6] proposed an algorithm for preemptive scheduling of jobs with given identical release times on identical parallel machines. Authors show that the problem can be solved in polynomial time. However, the problem is unary NP-hard when processing times are not identical for general cases.

To efficiently utilize resource of systems and overcome the lack of scalability and adaptability of existing mechanisms, Castillo et al. [8] consider two aspects of advance

(18)

7

reservations. Main ideas are investigating the influence for the resource management system and developing heterogeneity-aware scheduling algorithm. Temporal fragmentation might happen because of assuring the availability of resources at a future time. To utilize the non-continuous time slots of processors and schedule jobs with chain dependency, Lin et al. [32] proposed two algorithms. They proposed a greedy algorithm and a dynamic programming algorithm to find near optimal schedules for jobs with single chain dependency and multiple-chain dependency, respectively. Although much research was proposed for replica placement in parallel and distributed systems, most of research was without or with less consideration for quality-of-service [26, 45]. Cheng et al. [11]

proposed a more realistic model for this issue. Various kinds of cost for data replication are considered and the capacity of each replica server is limited. Because the QoS-aware replica placement is NP-complete, two heuristic algorithms are proposed to approximate the optimal solution and adapt various parallel and distributed systems. Wu et al. [50]

proposed two algorithms and a new model for three issues concerning data replica placement in hierarchical data grids that can be presented as tree structures. The first algorithm is to ensure load balance among replicas by finding the optimal locations to balance the workload. The second algorithm is to minimize number of replicas when the maximum workload capacity of each replica server is known. In the new model, a quality-of-service guarantee of each request must be given.

Research has been applied Genetic Algorithms to make a dynamically and continuous improvement on power saving methodology [41]. Through a software based methodology, routing paths are modified, link speed and working voltage are monitored and modified at the same time to reduce power consumption. Zhou et al. [54] proposed a temperature-aware register reallocation method to balance the processor workload by node remapping for thermal control. The peripheral devices of computer such as disk

(19)

8

subsystem are a high energy consumption hardware. Current OS provides hardware power saving techniques [56, 60] to utilize and control CPU voltage and frequency. Son et al. [42] studied how to implement disk energy optimization in compiler level. They developed a benchmark system and real test environment to verify physical result, and disk start time, idle time, spindle speed, the disk accessing frequency of program and CPU / core number of each host were taken into consideration. Some groups studied the power saving strategy implementation in data center like database or search engine server.

Meisner et al. [34] provided a hardware based solution to detect power waste during idle time and designs a novel power supplier operation method. This approach was applied in enterprise-scale commercial deployments and saved 74% power consumption.

1.3 Achievements and Contribution of the Thesis

In this thesis, a series of research is provided to improve the performance of GEN_BLOCK redistribution. There are a series of formulas for array decomposition, three scheduling methods for various computing platforms. At first, a series of formulas is provided to decompose multi-dimensional array. Thus, nodes in the parallel system can obtain data exchanging information of size, source nodes and destination nodes simply and independently. Second, a communication scheduling heuristic to minimize communication cost is provided to improve scheduling heuristics for GEN_BLOCK redistribution. With the advancement of network, joining clusters become a trend to perform GEN_BLOCK redistribution. Complex network attributes and node heterogeneity require new scheduling algorithms instead of present methods.

Therefore, an optimization technique and a scheduling method for multiple cluster systems are proposed to keep an eye on load balancing and better communication cost.

With the improvement of computer architectures, novel machines with multi -core

(20)

9

CPUs become new elements of parallel systems. This thesis provides an energy efficient scheduling method help perform GEN_BLOCK redistribution on such platform.

It reduces energy consumption during data exchanging among processors by control voltage of each core. All proposed techniques in this thesis are novel, simple and efficient. Details of each method are described in following chapters. They substantially help improve the performance of GEN_BLOCK redistribution indeed.

1.4 Organization of the Thesis

The rest of this thesis is organized as follows. Chapter 2 provides preliminaries such as notations and terminology. In Chapter 3, formulas are provided to decompose multi-dimensional array before scheduling. Information of data size, source node and destination node can be obtained simply and independently by each node. Chapter 4 proposes a two-phase scheduling heuristic to improve the communication cost. Chapter 5 introduces various kinds of transmissions and an optimization technique to adapt the heterogeneity of network and proposes an efficient scheduling algorithm to avoid synchronization delay among clusters. With the improvement of computers, a novel energy efficient scheduling technique is proposed for multi-core CPUs based grid systems in Chapter 6. Finally, the conclusions and future work are given in Chapter 7.

(21)

10

Chapter 2 Preliminaries

Performing GEN_BLOCK redistribution among processors in grid system, much information are needed such as information of data exchanging, source and destination nodes of each message, data segment generation and schedule of redistribution. To simplify the presentation, notations and terminology used in this thesis are defined as follows:

Definition 1: Given an GEN_BLOCK redistribution on a 1-D array A[1:N] over P processors, SPi denotes the source processors of array elements A[1:N]; DPi denotes the destination processors of array elements A[1:N], where 0  i  P1.

Definition 2: Given a bipartite graph G = (V, E) to represent the communication patterns of an GEN_BLOCK redistribution on A[1:N] over P processors, vertices of G are used to represent the source and destination processors. Figure 2.1 gives an example of bipartite graph representing communication patterns between four source and destination processors with seven messages to be communicated.

SP0

SP3

SP1

SP2

DP0

DP3

DP1

DP2

Figure 2.1: A bipartite graph representing communication patterns.

(22)

11

Definition 3: Given a directed bipartite graph G = (V, E), Degree_max denotes the maximal in-degree (or out-degree) of vertices in G. For example, the bipartite graph shown in Figure 2.2 is with Degree_max = 3, which is equal to the out-degree of SP₀ and in-degree of DP₃.

SP₀

SP₃ SP₁

SP₂

DP₀

DP₃ DP₁

DP₂

Figure 2.2: A bipartite graph with Degreemax=3.

Definition 4: Given source and destination distribution schemes. SPL_i and SPU_i denote the lower bound and upper bound of a parameter for SPi while mapping on global index, where 0  i  P1. DPLj and DPUj denote the lower bound and upper bound for DP_j, where 0  j  P1.

Definition 5: For a SP_i, the sending messages start from processor STS_i and end at processor STEi, where STSi = { j | DPLj ≤ SPLi ≤ DPUj } and STEi = { j | DPLj ≤ SPUi ≤ DPU_j }.

Definition 6: For a DPi, the receiving messages start from processor RTSi and end at processor RTE_i, where RTS_i = { j | SPL_j ≤ DPL_i ≤ SPU_j } and RTE_i = { j | SPL_j ≤ DPU_i ≤ SPUj }.

Definition 7: TNST_i and TNRT_i are defined as total number of sending messages of SP_i and receiving messages of DPi, respectively.

Example of above definitions are given below. With {20, 50, 30} for source

(23)

12

distribution scheme and {30, 20 ,50} for destination distribution scheme, the global index of array for both schemes are illustrated in Figure 2.3. The SPL₁ and SPU₁ are 21 and 70 while DPL2 and DPU2 are 51 and 100. For SP1, STS1 is 0 and STE1 is 2 due to the index of SP₁ overlaps DP₀, DP₁ and DP₂. With the same argument, RTS₂ is 1 and RTE₂ is 2 due to the index of DP2 overlaps SP1 and SP2. The TNST1 is 3 for SP1 while TNRT2 is 2 for DP_2.

Global Index of Array

DP₀ DP₁ DP₂

0 30 50 100

SP0 SP1 SP2

0 20 70 100

Figure 2.3: The global index of array for given schemes.

Definition 8: Performing GEN_BLOCK redistribution on P processors between clusters in grids. The value NC represents the number of nodes in every Ci while Ci

represents the ID of clusters, where 0  i  ⌊P/NC⌋. SpeedCi,j represents the transmitting

rate from Ci to Cj.

Definition 9: Given a directed bipartite graph G = (V, E), SPi and DPj  V, where 0  i, j  P1; SPi and DPj  Cr, where 0  r  ⌊P/NC⌋; message mk  E, where 1  k  2P1.

While SPi sends a message mk to DPj between clusters Cr, mk represents the messages of local data access if i = j; mk represents the messages of distant data access if SPi and DPj

belong to different Cr; otherwise mk represents remote data access.

(24)

13

Chapter 3 Communication Sets Generation

3.1 Introduction

Parallel computing systems have been used to solve complex scientific problems with aggregates of data such as arrays to extend sequential programming language. Under this paradigm, appropriate data distribution is critical to the performance of each phase in a multi-phase program. Because the phases of a program are different from one another, the optimal distribution changes due to the characteristics of each phase, as well as on those of the following phase. In order to achieve good load balancing, improved data locality and reduced inter-processor communication during runtime, data redistribution is critical during operation.

Data distribution can generally be classified as either regular or irregular. Regular distribution employs BLOCK, CYCLIC, or BLOCK-CYCLIC(c) to specify array decomposition with equal data segments while irregular distribution uses user-defined functions to specify uneven array distribution. Thus, algorithms for regular distribution cannot be applied in irregular distribution.

Irregular distribution maps unequally sized and continuous segments of an array onto processors. This makes it possible to provide the appropriate data quantity to processors with different computation capabilities. High Performance Fortran version 2 (HPF2) provides the GEN_BLOCK distribution format to facilitate generalized block distributions.

The GEN_BLOCK directive distributes unequally sized segments to processors accordingly to fit the requirements of irregular distribution. The following is a code fragment of

(25)

14

HPF-2 for GEN_BLOCK distribution and redistribution.

PARAMETER (S = /5, 24, 23, 7, 31, 3/)

!HPF$ PROCESSORS P(6)

REAL A(93), new (6)

!HPF$ DISTRIBUTE A (GEN_BLOCK(S)) onto P

!HPF$ DYNAMIC

new = /18, 20 ,8, 21, 9, 17/

!HPF$ REDISTRIBUTE A (GEN_BLOCK(new))

In the above code segment, S and new are defined as the distribution schemes of the source and destination processors, respectively. DISTRIBUTE directive decomposes array A onto 6 processors according to the S distribution scheme in the source phase.

After the REDISTRIBUTE directive, elements in array A are realigned according to the new distribution scheme.

This code fragment shows the indexing scheme issues for GEN_BLOCK redistribution.

An indexing scheme provides information of array segments (messages) for data exchange among processors, including source node, destination node, data size and the memory location. Rare researches focused on indexing scheme, but it is important for algorithm performance. The indexing techniques should provide information about source, destination and size of each communication. These techniques should obtain information for each node independently since centralized methods are with higher time complexity.

In this study, multi-dimensional indexing methods to generate communication sets for multi-dimensional processor sets are proposed. The indexing scheme using mathematically closed forms to generate communication sets in an efficiency manner.

These characteristics are listed below:

(26)

15

 The proposed multi-dimensional indexing method maps the range of messages onto an array index.

 The indexing scheme provides equations for each processor to determine specifics information of its own messages. The information includes source processor, destination processor, message size and number of sending and receiving messages.

3.2 Related Works

Techniques for communication sets identification techniques, message packing and unpacking techniques include the PITFALLS [38] and the ScaLAPACK [37] methods for index sets generation; Park et al. [35] devised algorithms for BLOCK-CYCLIC data redistribution between processor sets; Dongarra et al. [36] proposed algorithmic redistribution methods for BLOCK-CYCLIC decompositions; Zapata et al. [4] proposed parallel sparse redistribution code for BLOCK-CYCLIC data redistribution based on CRS structure. Lin and Chung [31] proposed CFS and ED for sparse array distribution. The Generalized Basic-Cycle Calculation method was presented in [18]. Bai et al. [3] also constructed rules for determining communication sets of general data redistribution instances. The following chapters give mathematically closed forms to generate communication sets for single dimensional and multi- dimensional arrays.

3.3 Communication Sets Generation for Single Dimensional Array

To facilitate the presentation of communication sets, we first introduce the message representation scheme to specify the range, quantity, source node and destination node of a data segment. Figure 3.1 shows two distribution schemes on A[1:93] over six processors.

(27)

16

Distribution I ( Source Processor ) SP SP₀ SP₁ SP₂ SP₃ SP₄ SP₅

Size 5 24 23 7 31 3

Distribution II ( Destination Processor ) DP DP0 DP1 DP2 DP3 DP4 DP5

Size 18 20 8 21 9 17

Figure 3.1: Data distribution schemes for source and destination processors.

Values in row of size represent quantity of data segment for a processor. First of all, array is distributed to processors according to the scheme “Distribution I (Source Processor)”. As shown in Figure 3.2 (a), the first 5 elements of array are distributed to the first processor P₀ and the next 24 elements are distributed to the second processor, P₁, etc. When changing a computing phase, a new distribution scheme like “Distribution II (Destination Processor)” in Figure 3.1 is given to redistribute a different range of data segments of array to processors. Figure 3.2 (b) shows the overlap of elements that are distributed to processors of both schemes. In Figure 3.2 (b), processors above the array are SPi while processors below the array are DPi. Values above SPi and below DPi are the quantity of data elements for processors. The all 5 elements of SP₀ and first 13 elements of SP1 are sent to DP0 to satisfy the requirement of 18 elements according to both schemes, therefore block a represents the elements from SP0 and b represents the elements from SP1. Figure 3.2 (b) shows not only the overlap but also the relation of exchanging data elements among SP_i and DP_i. Figure 3.2 (c) illustrates the result of DP_i gathering blocks from SPi.

(28)

17

(a)

(b)

. (c)

Figure 3.2: 1-D array decomposition. (a) Mapping parameters of “Distribution I” to array. (b) Overlapped decomposition of “Distribution I” and “Distribution II”. (c) DPi receives blocks from SPi.

Since data segments like b may be redistributed to different processors, the information of the source node and destination should be known first. In other words, a source processor should know how many data elements and which processor it will be sent to. Similarly, a destination processor should know how many data elements it will receive and from which processor.

The proposed method assumes that array S and array D store parameters of source and destination distribution schemes, respectively. The first element of both arrays are 0, parameters are stored starting with the second element. Accordingly, S[a] = {0, 5, 24, 23, 7, 31, 3} and D[b] = {0, 18, 20, 8, 21, 9, 17}.

To simplify the presentation, a message that contains consecutive data elements

(29)

18

starting from the mth element and ends at the nth element sent from SPi to DPj can be formulated as following equation,

send j

msgi_ = [m, n] (3.1) where m<n, m is the left index of the message msg_i^send__j in global array and n is the right index of the message msg_i^send__j in global array.

m = Max (1+



 i

a

a S

0

] [ ,1+



 j

b

b D

0

]

[ ) (3.2)

n = Min (



^

 1

0

] [

i

a

a S ,



^

 1

0

] [

j

b

D ) (3.3) In order to determine the range of each message, equations (3.2) and (3.3) provide an efficient method to calculate the starting and ending element mapped onto global index.

Since m is the starting element, it can not be larger than the end element n. As a result, for any data segment [m, n] with m larger than n, it will be dropped. Similarly to (3.1), the receiving messages of DPi are determined with equation (3.4).

receive i

msgj_ = [m’, n’] (3.4) where m’ < n’, m’ is the left index of the message _msg^receive_j__i in global array and n’ is the right index of the message _msg^receive_j__i in global array.

m’ = Max(1+



 j

a

a S

0

] [ ,1+



 i

b

b D

0

]

[ ) (3.5) n’ = Min(



^

 1

0

] [

j

a

a S ,



^

 1

0

] [

i

b

D ) (3.6) Let us take P0 as an example; there is one sending message and one receiving message. The sending message is P0 and [m, n] is [1, 5]. That means sender P0 (SP0) sends five data units to receiver P0 (DP0). When mapped onto the global array index, the index is A[1:5]. The receiving messages of receiver P0 are P0 and P1, their [m, n] are [1, 5]

(30)

19

and [6, 18], respectively. It means that P0 receives 5 and 13 data units from P0 (SP0) and P₁ (SP₁), respectively. The indices are A[1:5] and A[6:18] when mapping onto the global index.

Analyzing the distribution of data block in Figure 3.2 (b), the number of sending messages and receiving messages can be known. Take destination processor P1 for example, DP₁ receives blocks c and d from SP₁ and SP₂, respectively. DPL₁ is an element of SP1 and DPU1 is an element of SP2. Since elements between DPL1 and DPU1

are initially mapped onto SP₁ and SP₂, the number of receiving messages for DP₁ is 2. If DPL1 and DPU1 are mapped onto SP1 and SP4, there are 4 receiving messages.

The number of sending messages and receiving messages can be formulated as in equation (3.7) and (3.8), respectively.

TNSTi = STEi – STSi +1 (3.7)

TNRT_i= RTE_i – RTS_i +1 (3.8) The algorithm for generating messages for processor i is given as follows.

Input: Distribution scheme of source processors and distribution scheme of destination processors.

Output: Messages, sending messages and receiving messages of processor i.

1. i = MPI_Comm_rank();

2. p = total number of processors;

3. if( i = 0)

4. {distributing distribution schemes to processors;}

5. Store parameters of schemes in S[] and D[];

6. /* Sending phase*/

(31)

20

7. Processor i calculates its sending messages;

8. { for(j=STS_i; j≤STE_i; j++)

9. Calculate the range of message sending to DPj by Equation (3.1);

10. }

11. /* Receiving phase*/

12. Processor i calculates its receiving messages 13. { for(j= RTSi; j≤RTEi; j++)

14. Calculate the range of messages receiving from SP_jby calculating Equation 15. (3.4);

16. }

17. end_of_generating_messages_for_processor_i

3.4 Communication Sets Generation for Multi-Dimensional Array

To simplify the presentation of communication sets generation with multi-dimensional arrays, we firstly use a two-dimensional mapping model to explain the indexing scheme. Figure 3.3 shows an example of a two-dimensional array, decomposed as {2, 6} in the first dimension and {2, 4} in the second dimension, mapped onto a 2x2 processor grid.

Figure 3.3: 2-D array decomposition with source distribution scheme {2, 6}*{2, 4}.

(32)

21

Each of the 4 blocks, according to the source distribution scheme, is allocated to a processor. Similarly to 1-D scenario, the 2-D ranges of each block are defined as the combination of upper and lower bounds. For example, the data block allocated to SP00 is [1, 2]*[1, 2], data block [1, 2]*[3, 6] is allocated to SP₁₀ and so on. For a data redistribution instance, the destination data distribution scheme, is defined as {6, 2}*{3, 3}

and the decomposition is illustrated in Figure 3.4 (a). Each block is allocated to a processor but with a different area. In order to illustrate the redistributed messages, Figure 3.4 (b) overlaps the source distribution (scheme-I) and destination distribution (scheme-II) with solid lines and dotted lines, respectively. New blocks are created by two kinds of lines to represent messages. Processors transmit the generated messages to their destination processors in accordance with the scheme-II.

(a) (b)

Figure 3.4: 2-D array decomposition. (a) Destination distribution scheme {6, 2}*{3, 3}]. (b) Overlapped decomposition.

Similarly to the 1-D indexing scheme, array sets are required for multi-dimensional indexing schemes. Thus, Sx and Dx store the parameters for source and destination distribution schemes for the first dimension; Sy and Dy are for the second dimension.

Accordingly, given an S = {Sx(), Sy()} to new = {Dx(), Dy()} 2-D GEN_BLOCK array redistribution, messages for source processor SPxy to destination processor DPx’y’ can be

(33)

22

formulated as in the following equation, which is the 2-D indexing equation used to calculate the 2-D range of messages:

send y x

msgxy_ _' _' = [r, s]*[t, u] (3.9) where r  s and t  u;

r = Max(1+



 y

a

a Sy

0

] [ ,1+



 '

0

] [

y

b

Dy ) (3.10)

s = Min(



^

 1

0

] [

y

a

a Sy ,



^

 1 '

0

] [

y

b

Dy ) (3.11)

t = Max(1+



 x

a

a Sx

0

] [ ,1+



 '

0

] [

x

b

Dx ) (3.12)

u = Min(



^

 1

0

] [

x

a

a Sx ,



^

 1 '

0

] [

x

b

Dx ) (3.13)

Figure 3.5 illustrates the area indicated by equation (3.9) mapping on a two-dimensional array. According to the given distribution schemes, Sx[a] = {0, 2, 4}, Sy[a] = {0, 2, 6}, Dx[b] = {0, 3, 3} and Dy[b] = {0, 6, 2}. Data blocks that SP₀₁ sends to DP00, DP01, DP10 and DP11 are [3, 6]*[1, 2], [7, 8]*[1, 2], [3, 6]*[4, 2] and [7, 8]*[4, 2], respectively. However, only [3, 6]*[1, 2] and [7, 8]*[1, 2] are valid due to validation conditions r  s and t  u.

Figure 3.5: Area indicated by equation (3.9) on a two dimensional array.

(34)

23

3.5 Performance Analysis

To evaluate the performance of proposed mathematically closed forms, this chapter provides time complexity analysis in two aspects of centralized process and decentralized process. In centralized process, all communication information is obtained by one node.

Assumed there are n nodes in the system as shown in Figure 3.6, the maximum number of communications is n*n. Each arrow represents a communication and the time complexity to obtain the information of a arrow is O(n) for 1-D array since n values in distribution schemes are used for mathematically closed forms. Therefore, the time complexity to obtain n*n pairs of communication by one node is O(n³) for 1-D array.

Figure 3.6: In centralized process, n*n pairs of communications are obtained by one node.

In the decentralized process, each node can obtain its information of communications.

Figure 3.7 shows the source processor SP0 has at most n pairs of communications to its destinations. Similarly, Figure 3.8 shows the destination processor DP₀ has at most n

(35)

24

pairs of communications from its sources. SP0 obtains the information of n arrows and the time complexity is O(n²) for 1-D array since the time complexity of each communication is O(n). The time complexity is also O(n²) for DP0 to obtain n pairs of communications.

Figure 3.7: In decentralized process, SP0 obtains at most n pairs of communications.

Figure 3.8: In decentralized process, DP0 obtains at most n pairs of communications.

For 2-D array, the time complexity for each arrow is O(n²). The 2-D mathematic

(36)

25

closed forms provide equations 3.9-3.13 to obtain the information of communications.

The equation 3.9 requires 3.10-3.13 to calculate r, s, t and u of each data block. Since there are two dimensions, the time complexity is then O(n²) for each arrow. Therefore, the time complexity for centralized process with 2-D array and n*n pairs of communications is O(n⁴). The time complexity for decentralized process with 2-D array is O(n³). Figure 3.9 gives the time complexities of four mentioned conditions.

Figure 3.9: The time complexities of four conditions.

3.6 Summary

In this chapter, the mathematic closed forms are proposed for decomposing multi-dimensional array before scheduling. Information of data size, source node and destination node can be obtained simply and independently by each node. The performance analysis shows the proposed methods are efficient and with low time complexity.

(37)

26

Chapter 4 Communication Scheduling for Cluster

4.1 Introduction

The code fragment given in Chapter 3.1 shows not only the indexing scheme issue but also the communication scheduling issue for GEN_BLOCK redistribution for cluster.

Communication scheduling methods arrange messages to be sent or received in proper communication steps to shorten communication cost and achieve good load balancing.

Both are important issues for algorithm performance. In relation to the indexing scheme, more research was focused on scheduling communications as an optimization. However, obtaining the optimal schedule for such operation is an NP complete problem [48].

In this chapter, a communication scheduling heuristic, two-phase degree-reduction (TPDR) technique, to minimize communication cost is provided to improve scheduling heuristics for cluster. TPDR employs a degree reduction iteration phase and an adjustable coloring mechanism to derive each communication step of schedules recursively. The number of communication steps is minimized and node contention is avoided.

Characteristics of TPDR are listed below:

 TPDR technique is a simple and efficient scheduling method with low algorithmic complexity to perform GEN_BLOCK array redistribution for cluster.

 TPDR avoids node contentions while performing GEN_BLOCK redistribution.

 TPDR is a single pass scheduling technique. It does not need to reschedule or reallocate messages. Therefore, it is applicable to different processor groups without increasing the scheduling complexity.

(38)

27

4.2 Related Works

Many researches had focus on communication efficiency for GEN_BLOCK redistribution. Lee et al. [29] presented a logical processor reordering algorithm for GEN_BLOCK. Six algorithms for reducing communication cost are discussed. Guo et al.

[47] proposed a divide-and-conquer algorithm for performing GEN_BLOCK redistribution.

In [52], a relocation algorithm is proposed by Yook and Park. The relocation algorithm consists of two scheduling phases, the list scheduling phase and the relocation phase.

Wang et al. [48] proposed an improved algorithm based on the relocation algorithm.

Authors especially mentioned that it is an NP complete problem to obtain the optimal schedule. The multiple phases scheduling algorithm was proposed by Hsu et al. [20] to manage data segment assignment and communications to enhance the performance. Yu [53] defined a 2-approximmation algorithm for MECP, CMECP and SCMECP which can be used to model data redistribution problems. Cohen et al. [12] studied the KPBS problem and KPBS is proven to be NP-hard. In this research, at most k communications are supposed to be performed at the same time between clusters interconnected by a backbone. Chang et al. [10] proposed HCS and HRS to improve the solutions of data access on data grids. Rauber et al. [39] presented a data redistribution library for multiprocessor task programming to split processor groups and to map tasks to processor groups.

4.3 Two-Phase Degree-Reduction Scheduling Algorithm

This chapter introduces a two-phase degree-reduction (TPDR) scheduling method for GEN_BLOCK redistribution for cluster. It guarantees the node contention free and derives schedules with number of minimal communication steps [47], which is equal to the

(39)

28

Degreemax defined in definition 3. TPDR consists of two phases, the degree reduction phase and the adjustable coloring mechanism phase. The first phase schedules communications of sending or receiving processors that with degree greater than two. In the degree reduction phase, the TPDR reduces the degree of nodes with the maximum degree by one when scheduling the corresponding message into an appropriate communication step. The degree reduction is performed as follows:

 Stage 1: Sort the vertices with maximum degree d by total size of messages in decreasing order. Assume there are k nodes with degree d. The sorted vertices would be <V1, V2, …, Vk>.

 Stage 2: Schedule the minimum message m_j = min{m₁, m₂, …, md} into step d from vertices V₁, V₂, …, Vk, where 1  j  d. If a message of Vk is scheduled due to the message is a minimum message of others. Then any other message of the V_k will not be scheduled for avoiding conflicts.

 Stage 3: Determine which messages are smaller than the length of step d. Schedule them into step d if there is no contention.

 Stage 4: Degree_max = Degree_max - 1. If Degree_max > 2, the flow goes to stage 1.

Definition 10: Let G’ denotes the graph after edges are removed in first phase of TPDR. |E’| denotes the number of edges in G’.

The second phase of TPDR consists of a coloring mechanism and an adjustable mechanism both used to schedule messages of sending or receiving processors when Degreemax less than or equal to 2. When Degreemax is 2, the left bipartite graph, G’, may consist of several connected bipartite graphs. Since a bipartite graph with Degree_max =2 is a linked list, messages in each connected components can be scheduled in two steps.

Namely, the messages can be colored in two colors. In order to minimize the total length of the two steps, the adjustable method is proposed and used to exchange scheduling steps

中 華 大 學

中 華 大 學 博 士 論 文

平行分散式與多核心系統上的 GEN_BLOCK 資料重新分配與最佳化技術

Developing GEN_BLOCK Redistribution Algorithms and Optimization Techniques on Parallel, Distributed and Multi-Core Systems

系 所 別：工程科學博士學位學程 學號姓名：D09424005 陳世璋

指導教授：許慶賢 博士

中 華 民 國 九十九 年 九 月

Developing GEN_BLOCK Redistribution Algorithms and Optimization Techniques on Parallel, Distributed and Multi-Core Systems

By

Shih-Chang Chen

Advisor: Prof. Ching-Hsien Hsu

Ph. D. Program in Engineering Science Chung-Hua University

707, Sec.2, WuFu Rd., HsinChu, Taiwan 300, R.O.C.

September 2010

摘要

ABSTRACT

Acknowledgements

Table of Contents

List of Tables

List of Figures

Chapter 1 Introduction

1.1 Motivations

1.2 Related Works

1.3 Achievements and Contribution of the Thesis

1.4 Organization of the Thesis

Chapter 2 Preliminaries

Chapter 3

Communication Sets Generation

3.1 Introduction

3.2 Related Works

3.3 Communication Sets Generation for Single Dimensional Array

















3.4 Communication Sets Generation for Multi-Dimensional Array

















3.5 Performance Analysis

3.6 Summary

Chapter 4

Communication Scheduling for Cluster

4.1 Introduction

4.2 Related Works

4.3 Two-Phase Degree-Reduction Scheduling Algorithm

中華大學

中華大學博士論文

系所別：工程科學博士學位學程學號姓名：D09424005 陳世璋

指導教授：許慶賢博士

中華民國九十九年九月