## 中 華 大 學 博 士 論 文

### 平行分散式與多核心系統上的 GEN_BLOCK 資料重新分配與最佳化技術

### Developing GEN_BLOCK Redistribution Algorithms and Optimization Techniques on Parallel, Distributed and Multi-Core Systems

### 系 所 別：工程科學博士學位學程 學號姓名：D09424005 陳世璋

### 指導教授：許慶賢 博士

### 中 華 民 國 九十九 年 九 月

### Developing GEN_BLOCK Redistribution Algorithms and Optimization Techniques on Parallel, Distributed and Multi-Core Systems

### By

### Shih-Chang Chen

### Advisor: Prof. Ching-Hsien Hsu

### Ph. D. Program in Engineering Science Chung-Hua University

### 707, Sec.2, WuFu Rd., HsinChu, Taiwan 300, R.O.C.

### September 2010

i

### 摘要

平行系統已經被利用來處理龐大計算資料和複雜的科學運算問題，以改善單機運 算程式上的緩慢作業速度。隨著電腦硬體架構的進步，單一電腦叢集、多個電腦叢集 或是擁有多核心的主機所組成的多叢集環境都可以當作一種平行運算系統。對於一個 有多個運算階段的平行程式而言，在平行系統上，適當的資料分配是影響平行程式表 現最重要的因素。因為每個運算階段彼此不盡相同，所以要求的最佳的資料分配方式 也有所不同。在資料重新分配的過程中，為了達到系統上的最佳負載平衡、資料最佳 化和降低在執行過程中的通訊成本，資料分配的演算法扮演著平行運算中重要的一 環。因此，在這份研究中，我們提出了一套有效的通訊訊息解析公式，以及對單一電 腦叢集、多個電腦叢集和擁有多核心的主機所組成的多叢集系統提出了三個通訊資料 排程演算法以及節能技術，以解決不規則的資料重新分配要面對的問題。通訊訊息解 析公式可在進行排程之前提供許多資訊，如訊息的來源端和目的端。每個平行系統中 的節點都能簡單地、有效地獨自利用公式取得所需資訊。在單一電腦叢集系統上，有 效率的排程演算法則能為異質性的環境提供最小排程步驟和較少通訊成本的排程。而 多個電腦叢集本身擁有複雜的網路環境和運算設備，為了適應該架構，改善的既有的 排程方法之缺點，我們也提出一個新的排程方法。這個方法將通訊分成三個類別，分 別是發生在單一節點、同一電腦叢集的兩個節點和兩個電腦叢集上的節點的通訊作 業，這個方法同時也改善了資料重新分配上的通訊同步問題。此外，在不規則的資料 分配研究裡，現在仍未有多核心的主機所組成的多叢集系統和節能的相關研究。因 此，通訊類別在這個研究中被重新定義，新的排程演算法進行排程時，節能方法也適 時的提供電壓控制方案，為這複雜系統上的每個核心以提供更好的節能效果。

關鍵字: 電腦叢集與多核心平行運算系統、不規則的資料重新分配、負載平衡、

排程演算法、節能

ii

**ABSTRACT **

Parallel computing systems have been used to solve complex scientific problems with aggregates of data such as arrays to extend sequential programming language . With the improvement of hardware architectures, parallel systems can b e a cluster, multiple clusters or a multi-cluster with multi-core machines. Under this paradigm, appropriate data distribution is critical to the performance of each phase in a multi-phase program. Because the phases of a program are different from one another, the optimal distribution changes due to the characteristics of each phase, as well as on those of the following phase. In order to achieve good load balancing, improved data locality and reduced inter-processor communication during runtime, data redistribution is critical during operation. In this study, formulas for message generation, three scheduling algorithms for single cluster, multiple clusters and multi-cluster system with multi-core machines and a power saving technique are proposed to solve problems for GEN_BLOCK redistribution. Formulas for message generation provide much information of source, destination and data which are needed before scheduling algorithms giving effective results. Each node can use the formulas to obtain the information simply, effectively and independently. An effective scheduling algorithm for a cluster system is proposed to apply on heterogeneous environment. It not only guarantees minimal schedule steps but also shortens communication cost. Multi-cluster computing provides complex network and heterogeneous processors to perform GEN_BLOCK redistribution. To adapt this architecture, a new scheduling algorithm is proposed to provide better result in terms of communication cost. This technique classifies transmissions among clusters into three types and schedules transmissions inside a node together to avoid

iii

synchronization delay. While employing multi-core machines to be a part of parallel systems, present scheduling algorithms are doubted to deliver good performance. In addition, efficient power saving techniques are not under consideration for any scheduling algorithms. Therefore, four kinds of transmission time are designed for messages to increase scheduling efficiency. While performing proposed scheduli ng algorithm, the efficient power saving technique is also executed to evaluate the voltage value to save energy for each core on the complicated system.

**Keywords: Multiple cluster and multi-core parallel computing system, GEN_BLOCK **
redistribution, Load balancing, Scheduling algorithm, Power saving.

iv

**Acknowledgements **

First of all, I would like to thank Prof. Ching-Hsien Hsu who is one's adviser. Prof.

Hsu is a conscientious and careful scholar. He provides many helpful suggestions not only for the thesis but also for one's life. One is fortunate to be Prof. Hsu’s Ph. D.

student.

I wish to express my sincere appreciation to oral examination committee members for paying great patient and time for one's oral presentation.

My cordial appreciation goes to members of Lab. members, they always give me support when I works on the thesis.

Finally, I would like to thank my girl friend Chien-Hui Lin and my family for giving me great support, showing tolerance for me and giving me power to get through this 5 years

v

**Table of Contents **

Chinese Abstract ... i

English Abstract ... ii

Acknowledgements ... iv

Table of Contents ... v

**List of Tables ... vii **

**List of Figures ... viii **

**1 Introduction ... 1 **

1.1 Motivations ... 1

**1.2 Related Works ... 3 **

**1.3 Achievements and Contribution of the Thesis ... 8 **

1.4 Organization of the Thesis ... 9

**2 Preliminaries ... 10 **

**3 Communication Sets Generation ... 13 **

3.1 Introduction ... 13

3.2 Related Works ... 15

3.3 Communication Sets Generation for Single Dimensional Array... 15

3.4 Communication Sets Generation for Multi-Dimensional Array... 20

3.5 Performance Analysis ... 23

3.6 Summary ... 25

**4 Communication Scheduling for Cluster ... 26 **

4.1 Introduction ... 26

4.2 Related Works ... 27

4.3 Two-Phase Degree-Reduction Scheduling Algorithm ... 27

4.4 Lemmas and Analysis ... 34

4.5 Performance Evaluation ... 38

4.5.1 Simulation Results ... 38

4.5.2 Experimental Results ... 41

4.6 Summary ... 44

**5 Communication Scheduling for Grid ... 45 **

5.1 Introduction ... 45

5.2 Related Works ... 46

5.3 Local Message Reduction and Inter Message Amplification ... 47

5.4 Multiple Level Scheduling Algorithm ... 51

5.5 Theoretical Analysis ... 54

5.6 Performance Evaluation ... 57

vi

5.7 Summary ... 64

**6 Energy Efficient Scheduling for Multi-Core System ... 65 **

6.1 Introduction ... 65

6.2 Related Works ... 66

6.3 Quad-Core CPU System Structure ... 66

6.4 Dynamic Voltage Communication Scheduling Technique ... 68

6.5 Performance Evaluation ... 73

6.6 Summary ... 81

**7 Conclusions and Future Work ... 82 **

**Current Publication Lists ... 84 **

**References... 88 **

vii

**List of Tables **

Table 5.1: Configurations of data sets ... 57 Table 6.1: Configurations of data sets ... 74

viii

**List of Figures **

Figure 2.1: A bipartite graph representing communication patterns ... 10

*Figure 2.2: A bipartite graph with Degree**max*=3 ... 11

Figure 2.3: The global index of array for given schemes ... 12

Figure 3.1: Data distribution schemes for source and destination processors ... 16

**Figure 3.2: 1-D array decomposition ... 17 **

**Figure 3.3: 2-D array decomposition with source distribution scheme {2, 6}*{2, 4} ... 20 **

**Figure 3.4: 2-D array decomposition ... 21 **

**Figure 3.5: Area indicated by equation (3.9) on a two dimensional array ... 22 **

*Figure 3.6: In centralized process, n*n pairs of communications are obtained by one node *
** ... 23 **

*Figure 3.7: In decentralized process, SP*0** obtains at most n pairs of communications ... 24 **

*Figure 3.8: In decentralized process, DP*0** obtains at most n pairs of communications... 24 **

**Figure 3.9: The time complexity of four condictions ... 25 **

**Figure 4.1: A directed bipartite graph represents the relation between SP and DP ... 29 **

*Figure 4.2: First phase of TPDR ... 30 *

**Figure 4.3: Second phase of TPDR ... 31 **

**Figure 4.4: Consecutive bipartite graphs ... 35 **

**Figure 4.5: Statistics of comparisons ... 39 **

**Figure 4.6: The summation of total cases ... 40 **

**Figure 4.7: Results with different data size range of each node ... 43 **

**Figrue 5.1: Transmissions with LAT, RAT and DAT in the grid system ... 48 **

*Figure 5.2: The transmitting rate of LAT, RAT and DAT in environment I ... 49 *

*Figure 5.3: The transmitting rate of LAT, RAT and DAT in environment II ... 49 *

**Figrue 5.4: Communication patterns with modified cost ... 51 **

Figure 5.5: A bipartite graph representing communications between source and destination processors ... 53

**Figure 5.6: A schedule of TPDR ... 53 **

**Figure 5.7: A schedule of TPDR with modified cost ... 54 **

**Figure 5.8: A schedule which is given by MLS ... 54 **

**Figure 5.9: A bipartite graph with Degree***max ***= 4 ... 56 **

**Figure 5.10: Comparisons of MLS and TPDR with NC = 3 ... 59 **

**Figure 5.11: Comparisons of MLS and TPDR with NC = 4 ... 61 **

**Figure 5.12: Comparisons of MLS and TPDR with NC = 5 ... 63 **

**Figure 6.1: Intel Quad-Core CPU system structure ... 67 **

ix

Figure 6.2: AMD Quad-Core CPU system structure ... 68

**Figure 6.3: The architecture of multi-core machines in two clusters in gird system ... 71 **

*Figure 6.4: A schedule of DVC ... 72 *

Figure 6.5: *Values of R**m* in each step ... 73

*Figure 6.6: Comparisons of DVC and TPDR with NC=8. ... 76 *

*Figure 6.7: Comparisons of DVC and TPDR with NC=16 ... 77 *

*Figure 6.8: Improvement of DVC on power consumption with NC = 8 ... 79 *

*Figure 6.9: Improvement of DVC on power consumption with NC = 16 ... 80 *

1

**Chapter 1 ** **Introduction **

Parallel computing systems such as PC clusters provide powerful computational
ability to solve various kinds of scientific problems. A complex scientific problem may
be consisted of several computation phases, each phase is responsible to execute different
algorithm and may require different set of data. In order to achieve better load balance
and communication locality, it is necessary to redistribute data according to the desired
data distribution scheme. Generally, data distribution can be classified into regular
distribution and irregular distribution. The regular distribution employs BLOCK, CYCLIC
*and BLOCK-CYCLIC(c) to specify array decomposition. The irregular distribution uses *
user-defined function to specify unevenly array decomposition such as GEN_BLOCK.

**1.1 Motivations **

Researches of GEN_BLOCK redistribution are classified into index generation and scheduling algorithm, generally. Much information of source, destination and data are needed before scheduling algorithms giving efficient results. This thesis provides multi-dimensional index generator formulas for GEN_BLOCK redistribution because rare researches are given in the past. Each node can have the information of messages which it exchanges with others including source node, destination node and data size by proposed formulas. Formulas are simple and nodes can derive these information independently.

Efficient scheduling algorithms keep an eye on load balancing and provide schedules with low communication cost. They help shorten the total communication time and avoid synchronization delay among processors. The estimated cost of schedules determines the

2

performance of algorithms. In order to reduce the cost, many scheduling algorithms were proposed to give better schedules for GEN_BLOCK redistribution. In this thesis, an efficient and simple method is proposed for single cluster based computing platform because previous methods have their own shortcomings. The proposed method can apply on heterogeneous environment, guarantee minimal schedule steps and shorten communication cost effectively.

The property of network and heterogeneity of processors are paid close attention because the multi-cluster computing becomes popular. It is hard for present scheduling algorithms to derive effective results for such environment due to they are not designed for this architecture. The performance of these algorithms for multi-cluster computing is doubted. Therefore, a new scheduling technique is proposed to adapt the complex parallel systems. This technique classifies transmissions into three types i.e.

transmissions happened inside a node, across nodes and across clusters. The size of first type is relative small in practice but not in algorithm level. It will cause synchronization delay since the transmission of such type is delivered and the largest message still on the way, resulting in processors idle time. The proposed technique schedules transmissions inside a node together to avoid synchronization delay.

With the improvement of hardware architectures, multi-core computing node becomes popular for parallel systems recently. Plenty of researches designed scheduling algorithms for GEN_BLOCK redistribution for single-core parallel systems. However, algorithms designed for such scenario is doubted whether they adapt to multi-core parallel system. In addition, efficient power saving techniques are not under consideration with any scheduling algorithms. While applying them on a huge parallel system (e.g. grid systems), it will be a lot of waste not only on time but also on energy consumption.

Therefore, an energy efficient scheduling technique is proposed for performing

3

GEN_BLOCK redistribution on multi-core parallel systems. The proposed method designs four kinds of transmission time and a scheduling policy for messages to increase scheduling efficiency. With derived schedules, the evaluated the voltage value are given to reduce power consumption for each core on the complicated system.

**1.2 Related Works **

Many methods for performing array redistribution, transmission managements and power saving have been presented in literature. The array redistribution research was mostly developed for regular or irregular problems [15] in multi-computer compiler techniques or runtime support techniques. The power saving research was mostly discussing working, idle time and power waste problems in some architectures.

Below is a description of the related works according to these three aspects.

Techniques for regular array redistribution, in general, can be classified into three
approaches: the communication sets identification techniques, message packing and
unpacking techniques and communication optimizations. The first includes the
*PITFALLS [38] and the ScaLAPACK [37] methods for index sets generation; Park et al. [35] *

devised algorithms for BLOCK-CYCLIC data redistribution between processor sets;

*Dongarra et al. [36] proposed algorithmic redistribution methods for BLOCK-CYCLIC *
*decompositions; Zapata et al. [4] proposed parallel sparse redistribution code for *
*BLOCK-CYCLIC data redistribution based on CRS structure. *Lin and Chung [31]

*proposed CFS and ED for sparse array distribution. * *The Generalized Basic-Cycle *
*Calculation method was presented in [18]. Bai et al. [3] also constructed rules for *
determining communication sets of general data redistribution instances. In [2], they also
*proposed ECC method for a processor to pack/unpack array elements efficiently. Jeannot *
*et al. [24] presented a comparison between different existing scheduling methods. Reddy *

4

*et al. [40] presented a tool which coverts ScaLAPACK programs to HeteroMPI programs *
and to run on heterogeneous networks of computers with good performance improvements.

Techniques for communication optimization, in general, use different approaches to reduce the communication overheads in a redistribution operation. Examples are the processor mapping techniques [16, 21, 22, 25] for minimizing data transmission overheads, the multiphase redistribution strategy [28] for reducing message startup cost, the communication scheduling approaches [13, 30] for avoiding node contention and the strip mining approach [46] for overlapping communication and computational overheads.

Huang and Chu [23] proposed an efficient communication scheduling method for
*block-cyclic redistribution. Souravlas et al. [43] presented a pipeline based message *
*passing strategy in tours net. Karwande et al. [27] presented CC-MPI to support *
*compiled communication technique. Sundarsan et al. [44] derived a contention-free *
algorithm for distribution two-dimensional block-cyclic arrays for 2-D processor grids on a
*prototype framework, ReSHAPE. *Data distributed in symmetrical matrices format requires efficient
techniques to decompos*e. Hsu et al. [19] proposed a node replacement scheme to reduce *
overall cost.

In irregular array redistribution, some studies have concentrated on the indexing and
*message generation while others have addressed communication efficiency. Guo et al. *

[17] presented a symbolic analysis method for communication set generation and reducing
the communication cost of irregular array redistribution. In communication efficiency,
*Lee et al. [29] presented a logical processor reordering algorithm for GEN_BLOCK. Six *
*algorithms for reducing communication cost are discussed. Guo et al. [47] proposed a *
divide-and-conquer algorithm for performing GEN_BLOCK redistribution. In this method,
*communication messages are first divided into groups using the Neighbor Message Set *
*(NMS). Messages with the same sender or receiver will be grouped into an NMS. The *

5

*communication steps are scheduled after merging the NMSs with no contention. In [52], *
a relocation algorithm is proposed by Yook and Park. The relocation algorithm consists
of two scheduling phases, the list scheduling phase and the relocation phase. The list
scheduling phase sorts global messages and arranges them into communication steps in
decreasing order. Because it is a conventional sorting operation, list scheduling is not a
complex algorithm. In the case of a contention, the relocation phase will perform
rescheduling for those conflicting messages. While the algorithm is in relocation phase,
an appropriate location will be allocated for the messages that can’t be scheduled in the list
scheduling phase. This leads to high scheduling overheads and degrades the performance
*of a redistribution algorithm. Wang et al. [48] proposed an improved algorithm based on *
the relocation algorithm. Similar to the relocation algorithm, their proposed approach
also has two phases, but the second phase will not be active in case of a conflicting
message. The second phase allocates conflicting messages after the first phase is finished.

In other words, the first phase allocates messages that will not result in contentions and
then the algorithm flows to the second phase. Because of combining the
divide-and-conquer and relocation algorithms, this algorithm provides more choices to
redirect messages and reduce the relocation cost. Authors especially mentioned that it is
an NP complete problem to obtain the optimal schedule. The multiple phases scheduling
*algorithm was proposed by Hsu et al. [20] to manage data segment assignment and *
communications to enhance the performance. Yu [53] defined a 2-approximmation
*algorithm for MECP, CMECP and SCMECP which can be used to model data *
*redistribution problems. Cohen et al. [12] studied the KPBS problem and KPBS is proven *
*to be NP-hard. In this research, at most k communications are supposed to be performed *
*at the same time between clusters interconnected by a backbone. Chang et al. [10] *

*proposed HCS and HRS to improve the solutions of data access on data grids. Rauber et *

6

*al. [39] presented a data redistribution library for multiprocessor task programming to split *
processor groups and to map tasks to processor groups.

*For larger scale of system such as grid and cloud, Assuncao et al. [1] evaluated six *
*scheduling strategies by investigating them with the use of Cloud resource. Grounds et al. *

[14] tried to satisfy the requests of numerous clients who submit workflows with deadlines.

Authors proposed a cost minimizing scheduling framework to compact the cost of
*workflows in a Cloud. Liu et al. [33] proposed GridBatch to solve the large-scale *
*data-intensive batch problems in Cloud. GridBatch help achieve good performance in *
*Amazon’s EC2 computing Cloud which represented a real client. Wee et al. [49] *

proposed a client-side load balancing architecture to improve existing techniques in Cloud.

The proposed architecture met QoS requirements and handle failure immediately. Yang
*et al. [51] proposed a six-step scheduling algorithm in for transaction-intensive *
cost-constrained Cloud workflows. The proposed algorithms provided lower execution
*cost with various deadlines of workflows. Chang et al. [9] proposed a Balanced Ant *
Colony Optimization algorithm to minimize the makespan of jobs for computing grid.

The proposed method simulates the behavior of ants to keep load balancing of whole
system. The desktop grid computing environment has a nature limitation that volunteers
can leave any time. The characteristic results in unexpected job suspension and
*unacceptable reliability. Byun et al. [7] proposed the Markov Job Scheduler Based on *
*Availability (MJSA) to solve these limitations. Brucker et al. [5,6] proposed an algorithm *
for preemptive scheduling of jobs with given identical release times on identical parallel
machines. Authors show that the problem can be solved in polynomial time. However,
the problem is unary NP-hard when processing times are not identical for general cases.

To efficiently utilize resource of systems and overcome the lack of scalability and
*adaptability of existing mechanisms, Castillo et al. [8] consider two aspects of advance *

7

reservations. Main ideas are investigating the influence for the resource management
system and developing heterogeneity-aware scheduling algorithm. Temporal
fragmentation might happen because of assuring the availability of resources at a future
time. To utilize the non-continuous time slots of processors and schedule jobs with chain
*dependency, Lin et al. [32] proposed two algorithms. They proposed a greedy algorithm *
and a dynamic programming algorithm to find near optimal schedules for jobs with single
chain dependency and multiple-chain dependency, respectively. Although much research
was proposed for replica placement in parallel and distributed systems, most of research
*was without or with less consideration for quality-of-service [26, 45]. Cheng et al. [11] *

proposed a more realistic model for this issue. Various kinds of cost for data replication
are considered and the capacity of each replica server is limited. Because the QoS-aware
replica placement is NP-complete, two heuristic algorithms are proposed to approximate
*the optimal solution and adapt various parallel and distributed systems. Wu et al. [50] *

proposed two algorithms and a new model for three issues concerning data replica placement in hierarchical data grids that can be presented as tree structures. The first algorithm is to ensure load balance among replicas by finding the optimal locations to balance the workload. The second algorithm is to minimize number of replicas when the maximum workload capacity of each replica server is known. In the new model, a quality-of-service guarantee of each request must be given.

Research has been applied Genetic Algorithms to make a dynamically and continuous
improvement on power saving methodology [41]. Through a software based
methodology, routing paths are modified, link speed and working voltage are monitored
*and modified at the same time to reduce power consumption. Zhou et al. [54] proposed a *
temperature-aware register reallocation method to balance the processor workload by node
remapping for thermal control. The peripheral devices of computer such as disk

8

subsystem are a high energy consumption hardware. Current OS provides hardware
power saving techniques [56, 60] to utilize and control CPU voltage and frequency. Son
*et al. [42] studied how to implement disk energy optimization in compiler level. They *
developed a benchmark system and real test environment to verify physical result, and disk
start time, idle time, spindle speed, the disk accessing frequency of program and CPU /
core number of each host were taken into consideration. Some groups studied the power
saving strategy implementation in data center like database or search engine server.

*Meisner et al. [34] provided a hardware based solution to detect power waste during idle *
time and designs a novel power supplier operation method. This approach was applied in
enterprise-scale commercial deployments and saved 74% power consumption.

**1.3 Achievements and Contribution of the Thesis **

In this thesis, a series of research is provided to improve the performance of GEN_BLOCK redistribution. There are a series of formulas for array decomposition, three scheduling methods for various computing platforms. At first, a series of formulas is provided to decompose multi-dimensional array. Thus, nodes in the parallel system can obtain data exchanging information of size, source nodes and destination nodes simply and independently. Second, a communication scheduling heuristic to minimize communication cost is provided to improve scheduling heuristics for GEN_BLOCK redistribution. With the advancement of network, joining clusters become a trend to perform GEN_BLOCK redistribution. Complex network attributes and node heterogeneity require new scheduling algorithms instead of present methods.

Therefore, an optimization technique and a scheduling method for multiple cluster systems are proposed to keep an eye on load balancing and better communication cost.

With the improvement of computer architectures, novel machines with multi -core

9

CPUs become new elements of parallel systems. This thesis provides an energy efficient scheduling method help perform GEN_BLOCK redistribution on such platform.

It reduces energy consumption during data exchanging among processors by control voltage of each core. All proposed techniques in this thesis are novel, simple and efficient. Details of each method are described in following chapters. They substantially help improve the performance of GEN_BLOCK redistribution indeed.

**1.4 Organization of the Thesis **

The rest of this thesis is organized as follows. Chapter 2 provides preliminaries such as notations and terminology. In Chapter 3, formulas are provided to decompose multi-dimensional array before scheduling. Information of data size, source node and destination node can be obtained simply and independently by each node. Chapter 4 proposes a two-phase scheduling heuristic to improve the communication cost. Chapter 5 introduces various kinds of transmissions and an optimization technique to adapt the heterogeneity of network and proposes an efficient scheduling algorithm to avoid synchronization delay among clusters. With the improvement of computers, a novel energy efficient scheduling technique is proposed for multi-core CPUs based grid systems in Chapter 6. Finally, the conclusions and future work are given in Chapter 7.

10

**Chapter 2 ** **Preliminaries **

Performing GEN_BLOCK redistribution among processors in grid system, much information are needed such as information of data exchanging, source and destination nodes of each message, data segment generation and schedule of redistribution. To simplify the presentation, notations and terminology used in this thesis are defined as follows:

*Definition 1: Given an GEN_BLOCK redistribution on a 1-D array A[1:N] over P *
*processors, SP**i** denotes the source processors of array elements A[1:N]; DP**i* denotes the
*destination processors of array elements A[1:N], where 0 i P1. *

*Definition 2: Given a bipartite graph G = (V, E) to represent the communication *
*patterns of an GEN_BLOCK redistribution on A[1:N] over P processors, vertices of G are *
used to represent the source and destination processors. Figure 2.1 gives an example of
bipartite graph representing communication patterns between four source and destination
processors with seven messages to be communicated.

**SP****0**

**SP****3**

**SP****1**

**SP****2**

**DP****0**

**DP****3**

**DP****1**

**DP****2**

**Figure 2.1: A bipartite graph representing communication patterns. **

11

*Definition 3: Given a directed bipartite graph G = (V, E), Degree** _{max}* denotes the
maximal in-degree (or out-degree) of vertices in G. For example, the bipartite graph

*shown in Figure 2.2 is with Degree*

_{max}*= 3, which is equal to the out-degree of SP*

_{0}and

*in-degree of DP*

_{3}.

**SP**_{0}

**SP**_{3}**SP**_{1}

**SP**_{2}

**DP**_{0}

**DP**_{3}**DP**_{1}

**DP**_{2}

**Figure 2.2: A bipartite graph with Degree***max*=3.

*Definition 4: Given source and destination distribution schemes. SPL*_{i}* and SPU*_{i}*denote the lower bound and upper bound of a parameter for SP**i* while mapping on global
index, where 0 i P1. DPL*j** and DPU**j* denote the lower bound and upper bound for
*DP** _{j}*, where 0 j P1.

*Definition 5: For a SP*_{i}*, the sending messages start from processor STS** _{i}* and end at

*processor STE*

*i*

*, where STS*

*i*

*= { j | DPL*

*j*

*≤ SPL*

*i*

*≤ DPU*

*j*

*} and STE*

*i*

*= { j | DPL*

*j*

*≤ SPU*

*i*≤

*DPU*

*}.*

_{j}*Definition 6: For a DP**i**, the receiving messages start from processor RTS**i* and end at
*processor RTE*_{i}*, where RTS*_{i}* = { j | SPL*_{j}* ≤ DPL*_{i}* ≤ SPU*_{j}* } and RTE*_{i}* = { j | SPL*_{j}* ≤ DPU** _{i}* ≤

*SPU*

*j*}.

*Definition 7: TNST*_{i}* and TNRT*_{i}* are defined as total number of sending messages of SP*_{i}*and receiving messages of DP**i*, respectively.

Example of above definitions are given below. With {20, 50, 30} for source

12

distribution scheme and {30, 20 ,50} for destination distribution scheme, the global index
*of array for both schemes are illustrated in Figure 2.3. The SPL*_{1}* and SPU*_{1} are 21 and 70
*while DPL*2* and DPU*2* are 51 and 100. For SP*1*, STS*1* is 0 and STE*1 is 2 due to the index
*of SP*_{1}* overlaps DP*_{0}*, DP*_{1}* and DP*_{2}*. With the same argument, RTS*_{2}* is 1 and RTE*_{2} is 2 due
*to the index of DP*2* overlaps SP*1* and SP*2*. The TNST*1* is 3 for SP*1* while TNRT*2 is 2 for
*DP*_{2. }

Global Index of Array

* DP*_{0}* DP*_{1}* DP*_{2}

0 30 50 100

* SP*0* SP*1* SP*2

0 20 70 100

**Figure 2.3: The global index of array for given schemes. **

*Definition 8: Performing GEN_BLOCK redistribution on P processors between *
*clusters in grids. The value NC represents the number of nodes in every C**i** while C**i*

represents the ID of clusters, where 0 i ⌊P/NC⌋. SpeedC*i,j* represents the transmitting

*rate from C**i** to C**j*.

*Definition 9: Given a directed bipartite graph G = (V, E), SP**i** and DP**j* V, where 0 i,
*j P1; SP**i** and DP**j* C*r*, where 0 r ⌊P/NC⌋; message m*k* E, where 1 k 2P1.

*While SP**i** sends a message m**k** to DP**j** between clusters C**r**, m**k* represents the messages of
*local data access if i = j; m**k** represents the messages of distant data access if SP**i** and DP**j*

*belong to different C**r**; otherwise m**k* represents remote data access.

13

**Chapter 3 **

**Communication Sets Generation **

**3.1 Introduction **

Parallel computing systems have been used to solve complex scientific problems with aggregates of data such as arrays to extend sequential programming language. Under this paradigm, appropriate data distribution is critical to the performance of each phase in a multi-phase program. Because the phases of a program are different from one another, the optimal distribution changes due to the characteristics of each phase, as well as on those of the following phase. In order to achieve good load balancing, improved data locality and reduced inter-processor communication during runtime, data redistribution is critical during operation.

Data distribution can generally be classified as either regular or irregular. Regular
*distribution employs BLOCK, CYCLIC, or BLOCK-CYCLIC(c) to specify array *
decomposition with equal data segments while irregular distribution uses user-defined
functions to specify uneven array distribution. Thus, algorithms for regular distribution
cannot be applied in irregular distribution.

Irregular distribution maps unequally sized and continuous segments of an array onto processors. This makes it possible to provide the appropriate data quantity to processors with different computation capabilities. High Performance Fortran version 2 (HPF2) provides the GEN_BLOCK distribution format to facilitate generalized block distributions.

The GEN_BLOCK directive distributes unequally sized segments to processors accordingly to fit the requirements of irregular distribution. The following is a code fragment of

14

HPF-2 for GEN_BLOCK distribution and redistribution.

**PARAMETER (S = /5, 24, 23, 7, 31, 3/) **

!HPF$ **PROCESSORS P(6) **

**REAL A(93), new (6) **

!HPF$ **DISTRIBUTE A (GEN_BLOCK(S)) onto P **

!HPF$ DYNAMIC

** ****new = /18, 20 ,8, 21, 9, 17/ **

!HPF$ **REDISTRIBUTE A (GEN_BLOCK(new)) **

* In the above code segment, S and new are defined as the distribution schemes of the *
source and destination processors, respectively. DISTRIBUTE directive decomposes

**array A onto 6 processors according to the S distribution scheme in the source phase.****After the REDISTRIBUTE directive, elements in array A are realigned according to the **
**new distribution scheme. **

This code fragment shows the indexing scheme issues for GEN_BLOCK redistribution.

An indexing scheme provides information of array segments (messages) for data exchange among processors, including source node, destination node, data size and the memory location. Rare researches focused on indexing scheme, but it is important for algorithm performance. The indexing techniques should provide information about source, destination and size of each communication. These techniques should obtain information for each node independently since centralized methods are with higher time complexity.

In this study, multi-dimensional indexing methods to generate communication sets for multi-dimensional processor sets are proposed. The indexing scheme using mathematically closed forms to generate communication sets in an efficiency manner.

These characteristics are listed below:

15

The proposed multi-dimensional indexing method maps the range of messages onto an array index.

The indexing scheme provides equations for each processor to determine specifics information of its own messages. The information includes source processor, destination processor, message size and number of sending and receiving messages.

**3.2 Related Works **

Techniques for communication sets identification techniques, message packing and
*unpacking techniques include the PITFALLS [38] and the ScaLAPACK [37] methods for *
*index sets generation; Park et al. [35] devised algorithms for BLOCK-CYCLIC data *
*redistribution between processor sets; Dongarra et al. [36] proposed algorithmic *
*redistribution methods for BLOCK-CYCLIC decompositions; Zapata et al. [4] proposed *
*parallel sparse redistribution code for BLOCK-CYCLIC data redistribution based on CRS *
structure. *Lin and Chung [31] proposed CFS and ED for sparse array distribution. *The
*Generalized Basic-Cycle Calculation method was presented in [18]. Bai et al. [3] also *
constructed rules for determining communication sets of general data redistribution
instances. The following chapters give mathematically closed forms to generate
communication sets for single dimensional and multi- dimensional arrays.

**3.3 Communication Sets Generation for Single ** **Dimensional Array **

To facilitate the presentation of communication sets, we first introduce the message
representation scheme to specify the range, quantity, source node and destination node of a
*data segment. Figure 3.1 shows two distribution schemes on A[1:93] over six processors. *

16

Distribution I ( Source Processor )
*SP * *SP*_{0} *SP*_{1} *SP*_{2} *SP*_{3} *SP*_{4} *SP*_{5}

Size 5 24 23 7 31 3

Distribution II ( Destination Processor )
*DP * *DP*0 *DP*1 *DP*2 *DP*3 *DP*4 *DP*5

Size 18 20 8 21 9 17

**Figure 3.1: Data distribution schemes for source and destination processors. **

Values in row of size represent quantity of data segment for a processor. First of all,
array is distributed to processors according to the scheme “Distribution I (Source
Processor)”. As shown in Figure 3.2 (a), the first 5 elements of array are distributed to
*the first processor P*_{0}* and the next 24 elements are distributed to the second processor, P*_{1},
etc. When changing a computing phase, a new distribution scheme like “Distribution II
(Destination Processor)” in Figure 3.1 is given to redistribute a different range of data
segments of array to processors. Figure 3.2 (b) shows the overlap of elements that are
distributed to processors of both schemes. In Figure 3.2 (b), processors above the array
*are SP**i** while processors below the array are DP**i**. Values above SP**i** and below DP**i* are
*the quantity of data elements for processors. The all 5 elements of SP*_{0} and first 13
*elements of SP*1* are sent to DP*0 to satisfy the requirement of 18 elements according to both
* schemes, therefore block a represents the elements from SP*0

**and b represents the elements**

*from SP*1. Figure 3.2 (b) shows not only the overlap but also the relation of exchanging

*data elements among SP*

_{i}*and DP*

_{i}*. Figure 3.2 (c) illustrates the result of DP*

*gathering*

_{i}*blocks from SP*

*i*.

17

(a)

(b)

. (c)

**Figure 3.2: 1-D array decomposition. (a) Mapping parameters of “Distribution I” to **
array. (b) Overlapped decomposition of “Distribution I” and “Distribution II”. (c)
*DP**i** receives blocks from SP**i*.

**Since data segments like b may be redistributed to different processors, the **
information of the source node and destination should be known first. In other words, a
source processor should know how many data elements and which processor it will be sent
to. Similarly, a destination processor should know how many data elements it will
receive and from which processor.

* The proposed method assumes that array S and array D store parameters of source and *
destination distribution schemes, respectively. The first element of both arrays are 0,

*parameters are stored starting with the second element. Accordingly, S[a] = {0, 5, 24, 23,*

*7, 31, 3} and D[b] = {0, 18, 20, 8, 21, 9, 17}.*

To simplify the presentation, a message that contains consecutive data elements

18

*starting from the mth element and ends at the nth element sent from SP**i** to DP**j* can be
formulated as following equation,

*send*
*j*

*msg**i*_{} * = [m, n] (3.1) *
*where m<n, m is the left index of the message * *msg*_{i}^{send}_{}_{j}* in global array and n is the *
right index of the message *msg*_{i}^{send}_{}* _{j}* in global array.

*m = Max (1+*

###

*i*

*a*

*a*
*S*

0

] [ ,1+

###

*j*

*b*

*b*
*D*

0

]

[ ) (3.2)

*n = Min (*

###

^{}

1

0

] [

*i*

*a*

*a*
*S* ,

###

^{}

1

0

] [

*j*

*b*

*b*

*D* ) (3.3)
In order to determine the range of each message, equations (3.2) and (3.3) provide an
efficient method to calculate the starting and ending element mapped onto global index.

*Since m is the starting element, it can not be larger than the end element n. As a result, *
*for any data segment [m, n] with m larger than n, it will be dropped. Similarly to (3.1), *
*the receiving messages of DP**i* are determined with equation (3.4).

*receive*
*i*

*msg**j*_{} *= [m’, n’] (3.4) *
*where m’ < n’, m’ is the left index of the message * _{msg}^{receive}_{j}_{}_{i}* in global array and n’ is the *
right index of the message _{msg}^{receive}_{j}_{}* _{i}* in global array.

*m’ = Max(1+*

###

*j*

*a*

*a*
*S*

0

] [ ,1+

###

*i*

*b*

*b*
*D*

0

]

[ ) (3.5)
*n’ = Min(*

###

^{}

1

0

] [

*j*

*a*

*a*
*S* ,

###

^{}

1

0

] [

*i*

*b*

*b*

*D* ) (3.6)
*Let us take P*0 as an example; there is one sending message and one receiving
*message. The sending message is P*0* and [m, n] is [1, 5]. That means sender P*0* (SP*0)
*sends five data units to receiver P*0* (DP*0). When mapped onto the global array index, the
*index is A[1:5]. The receiving messages of receiver P*0* are P*0* and P*1*, their [m, n] are [1, 5] *

19

*and [6, 18], respectively. It means that P*0* receives 5 and 13 data units from P*0* (SP*0) and
*P*_{1}* (SP*_{1}*), respectively. The indices are A[1:5] and A[6:18] when mapping onto the global *
index.

Analyzing the distribution of data block in Figure 3.2 (b), the number of sending
*messages and receiving messages can be known. Take destination processor P*1 for
*example, DP*_{1}** receives blocks c and d from SP**_{1}* and SP*_{2}*, respectively. DPL*_{1} is an
*element of SP*1* and DPU*1* is an element of SP*2*. Since elements between DPL*1* and DPU*1

*are initially mapped onto SP*_{1}* and SP*_{2}*, the number of receiving messages for DP*_{1} is 2. If
*DPL*1* and DPU*1* are mapped onto SP*1* and SP*4, there are 4 receiving messages.

The number of sending messages and receiving messages can be formulated as in equation (3.7) and (3.8), respectively.

*TNST**i **= STE**i** – STS**i* +1 (3.7)

*TNRT*_{i }*= RTE*_{i}* – RTS** _{i}* +1 (3.8)

*The algorithm for generating messages for processor i is given as follows.*

Input: Distribution scheme of source processors and distribution scheme of destination processors.

*Output: Messages, sending messages and receiving messages of processor i. *

**1. ** *i = MPI_Comm_rank(); *

**2. ** *p = total number of processors; *

**3. ** *if( i = 0) *

**4. ** {distributing distribution schemes to processors;}

**5. ** Store parameters of schemes in S[] and D[];

**6. ** **/* Sending phase*/ **

20

**7. ** *Processor i calculates its sending messages; *

**8. ** *{ for(j=STS*_{i}*; j≤STE*_{i}*; j++) *

**9. ** * Calculate the range of message sending to DP**j* by Equation (3.1);

**10. } **

**11. /* Receiving phase*/ **

**12. Processor i calculates its receiving messages ****13. { for(j= RTS***i**; j≤RTE**i**; j++) *

**14. Calculate the range of messages receiving from SP*** _{j }*by calculating Equation

**15. (3.4);**

**16. } **

**17. end_of_generating_messages_for_processor_i **

**3.4 Communication Sets Generation for ** **Multi-Dimensional Array **

To simplify the presentation of communication sets generation with multi-dimensional arrays, we firstly use a two-dimensional mapping model to explain the indexing scheme. Figure 3.3 shows an example of a two-dimensional array, decomposed as {2, 6} in the first dimension and {2, 4} in the second dimension, mapped onto a 2x2 processor grid.

**Figure 3.3: 2-D array decomposition with source distribution scheme {2, 6}*{2, 4}.**

21

Each of the 4 blocks, according to the source distribution scheme, is allocated to a
processor. Similarly to 1-D scenario, the 2-D ranges of each block are defined as the
*combination of upper and lower bounds. For example, the data block allocated to SP*00 is
*[1, 2]*[1, 2], data block [1, 2]*[3, 6] is allocated to SP*_{10} and so on. For a data
redistribution instance, the destination data distribution scheme, is defined as {6, 2}*{3, 3}

and the decomposition is illustrated in Figure 3.4 (a). Each block is allocated to a processor but with a different area. In order to illustrate the redistributed messages, Figure 3.4 (b) overlaps the source distribution (scheme-I) and destination distribution (scheme-II) with solid lines and dotted lines, respectively. New blocks are created by two kinds of lines to represent messages. Processors transmit the generated messages to their destination processors in accordance with the scheme-II.

** **

(a) (b)

**Figure 3.4: 2-D array decomposition. (a) Destination distribution scheme {6, 2}*{3, **
3}]. (b) Overlapped decomposition.

Similarly to the 1-D indexing scheme, array sets are required for multi-dimensional
*indexing schemes. Thus, Sx and Dx store the parameters for source and destination *
*distribution schemes for the first dimension; Sy and Dy are for the second dimension. *

**Accordingly, given an S = {Sx(), Sy()} to new = {Dx(), Dy()} 2-D GEN_BLOCK array ***redistribution, messages for source processor SP**xy** to destination processor DP**x’y’* can be

22

formulated as in the following equation, which is the 2-D indexing equation used to calculate the 2-D range of messages:

*send*
*y*
*x*

*msg**xy*_{} _{'} _{'}* = [r, s]*[t, u] (3.9) *
*where r s and t u; *

*r = Max(1+*

###

*y*

*a*

*a*
*Sy*

0

] [ ,1+

###

'

0

] [

*y*

*b*

*b*

*Dy* ) (3.10)

*s = Min(*

###

^{}

1

0

] [

*y*

*a*

*a*
*Sy* ,

###

^{}

1 '

0

] [

*y*

*b*

*b*

*Dy* ) (3.11)

*t = Max(1+*

###

*x*

*a*

*a*
*Sx*

0

] [ ,1+

###

'

0

] [

*x*

*b*

*b*

*Dx* ) (3.12)

*u = Min(*

###

^{}

1

0

] [

*x*

*a*

*a*
*Sx* ,

###

^{}

1 '

0

] [

*x*

*b*

*b*

*Dx* *) (3.13) *

Figure 3.5 illustrates the area indicated by equation (3.9) mapping on a
*two-dimensional array. According to the given distribution schemes, Sx[a] = {0, 2, 4}, *
*Sy[a] = {0, 2, 6}, Dx[b] = {0, 3, 3} and Dy[b] = {0, 6, 2}. Data blocks that SP*_{01} sends to
*DP*00*, DP*01*, DP*10* and DP*11 are [3, 6]*[1, 2], [7, 8]*[1, 2], [3, 6]*[4, 2] and [7, 8]*[4, 2],
respectively. However, only [3, 6]*[1, 2] and [7, 8]*[1, 2] are valid due to validation
*conditions r s and t u. *

**Figure 3.5: Area indicated by equation (3.9) on a two dimensional array. **

23

**3.5 Performance Analysis **

To evaluate the performance of proposed mathematically closed forms, this chapter provides time complexity analysis in two aspects of centralized process and decentralized process. In centralized process, all communication information is obtained by one node.

*Assumed there are n nodes in the system as shown in Figure 3.6, the maximum number of *
*communications is n*n. Each arrow represents a communication and the time complexity *
*to obtain the information of a arrow is O(n) for 1-D array since n values in distribution *
schemes are used for mathematically closed forms. Therefore, the time complexity to
*obtain n*n pairs of communication by one node is O(n*^{3}) for 1-D array.

**Figure 3.6: In centralized process, n*n pairs of communications are obtained by one node. **

In the decentralized process, each node can obtain its information of communications.

*Figure 3.7 shows the source processor SP*0* has at most n pairs of communications to its *
*destinations. Similarly, Figure 3.8 shows the destination processor DP*_{0}* has at most n *

24

*pairs of communications from its sources. SP*0* obtains the information of n arrows and *
*the time complexity is O(n*^{2}) for 1-D array since the time complexity of each
*communication is O(n). The time complexity is also O(n*^{2}*) for DP*0* to obtain n pairs of *
communications.

* Figure 3.7: In decentralized process, SP*0

**obtains at most n pairs of communications.*** Figure 3.8: In decentralized process, DP*0

**obtains at most n pairs of communications.***For 2-D array, the time complexity for each arrow is O(n*^{2}). The 2-D mathematic

25

closed forms provide equations 3.9-3.13 to obtain the information of communications.

*The equation 3.9 requires 3.10-3.13 to calculate r, s, t and u of each data block. Since *
*there are two dimensions, the time complexity is then O(n*^{2}) for each arrow. Therefore,
*the time complexity for centralized process with 2-D array and n*n pairs of *
*communications is O(n*^{4}). The time complexity for decentralized process with 2-D array
*is O(n*^{3}). Figure 3.9 gives the time complexities of four mentioned conditions.

**Figure 3.9: The time complexities of four conditions. **

**3.6 Summary **

In this chapter, the mathematic closed forms are proposed for decomposing multi-dimensional array before scheduling. Information of data size, source node and destination node can be obtained simply and independently by each node. The performance analysis shows the proposed methods are efficient and with low time complexity.

26

**Chapter 4 **

**Communication Scheduling for Cluster **

**4.1 Introduction **

The code fragment given in Chapter 3.1 shows not only the indexing scheme issue but also the communication scheduling issue for GEN_BLOCK redistribution for cluster.

Communication scheduling methods arrange messages to be sent or received in proper communication steps to shorten communication cost and achieve good load balancing.

Both are important issues for algorithm performance. In relation to the indexing scheme, more research was focused on scheduling communications as an optimization. However, obtaining the optimal schedule for such operation is an NP complete problem [48].

*In this chapter, a communication scheduling heuristic, two-phase degree-reduction *
*(TPDR) technique, to minimize communication cost is provided to improve scheduling *
*heuristics for cluster. TPDR employs a degree reduction iteration phase and an adjustable *
coloring mechanism to derive each communication step of schedules recursively. The
number of communication steps is minimized and node contention is avoided.

*Characteristics of TPDR are listed below: *

* TPDR technique is a simple and efficient scheduling method with low algorithmic *
complexity to perform GEN_BLOCK array redistribution for cluster.

* TPDR avoids node contentions while performing GEN_BLOCK redistribution. *

* TPDR is a single pass scheduling technique. It does not need to reschedule or *
reallocate messages. Therefore, it is applicable to different processor groups without
increasing the scheduling complexity.

27

**4.2 Related Works **

Many researches had focus on communication efficiency for GEN_BLOCK
*redistribution. Lee et al. [29] presented a logical processor reordering algorithm for *
*GEN_BLOCK. Six algorithms for reducing communication cost are discussed. Guo et al. *

[47] proposed a divide-and-conquer algorithm for performing GEN_BLOCK redistribution.

In [52], a relocation algorithm is proposed by Yook and Park. The relocation algorithm consists of two scheduling phases, the list scheduling phase and the relocation phase.

*Wang et al. [48] proposed an improved algorithm based on the relocation algorithm. *

Authors especially mentioned that it is an NP complete problem to obtain the optimal
*schedule. The multiple phases scheduling algorithm was proposed by Hsu et al. [20] to *
manage data segment assignment and communications to enhance the performance. Yu
*[53] defined a 2-approximmation algorithm for MECP, CMECP and SCMECP which can *
*be used to model data redistribution problems. Cohen et al. [12] studied the KPBS *
*problem and KPBS is proven to be NP-hard. In this research, at most k communications *
are supposed to be performed at the same time between clusters interconnected by a
*backbone. Chang et al. [10] proposed HCS and HRS to improve the solutions of data *
*access on data grids. Rauber et al. [39] presented a data redistribution library for *
multiprocessor task programming to split processor groups and to map tasks to processor
groups.

**4.3 Two-Phase Degree-Reduction Scheduling Algorithm **

*This chapter introduces a two-phase degree-reduction (TPDR) scheduling method for *
GEN_BLOCK redistribution for cluster. It guarantees the node contention free and derives
schedules with number of minimal communication steps [47], which is equal to the

28

*Degree**max** defined in definition 3. TPDR consists of two phases, the degree reduction *
phase and the adjustable coloring mechanism phase. The first phase schedules
communications of sending or receiving processors that with degree greater than two. In
*the degree reduction phase, the TPDR reduces the degree of nodes with the maximum *
degree by one when scheduling the corresponding message into an appropriate
communication step. The degree reduction is performed as follows:

* Stage 1: Sort the vertices with maximum degree d by total size of messages in *
*decreasing order. Assume there are k nodes with degree d. The sorted vertices *
*would be <V*1*, V*2*, …, V**k*>.

* Stage 2: Schedule the minimum message m*_{j}* = min{m*_{1}*, m*_{2}*, …, m**d**} into step d from *
*vertices V*_{1}*, V*_{2}*, …, V**k*, where 1 j d. If a message of V*k* is scheduled due to the
*message is a minimum message of others. Then any other message of the V** _{k}* will not
be scheduled for avoiding conflicts.

* Stage 3: Determine which messages are smaller than the length of step d. Schedule *
*them into step d if there is no contention. *

* Stage 4: Degree*_{max}* = Degree*_{max}* - 1. If Degree** _{max}* > 2, the flow goes to stage 1.

*Definition 10: Let G’ denotes the graph after edges are removed in first phase of *
*TPDR. |E’| denotes the number of edges in G’. *

*The second phase of TPDR consists of a coloring mechanism and an adjustable *
mechanism both used to schedule messages of sending or receiving processors when
*Degree**max** less than or equal to 2. When Degree**max** is 2, the left bipartite graph, G’, may *
*consist of several connected bipartite graphs. Since a bipartite graph with Degree** _{max}* =2 is
a linked list, messages in each connected components can be scheduled in two steps.

Namely, the messages can be colored in two colors. In order to minimize the total length of the two steps, the adjustable method is proposed and used to exchange scheduling steps