t OGTu{thZ mWGM092020035 G [ h EQ| ~ C An Efficient Communication Scheduling for Contention-Free Irregular Data Redistribution DGbWhstqT{ h j

(1)

中華大學碩士論文

題目：在不規則資料重新分配中建立有效的通訊排程

An Efficient Communication Scheduling for Contention-Free Irregular Data Redistribution

系所別：資訊工程學系碩士班學號姓名：M092020035 陳啟修指導教授：游坤明博士

(2)

(3)

(4)

(5)

An Efficient Communication Scheduling for Contention-Free Irregular Data

Redistribution

by

Chi-Hsiu Chen

B.S., Chung Hua University, 2004

Advisor: Prof. Kung-Ming Yu

Thesis submitted in partial fulfillment of requirements for the Degree of Master in

Department of Computer Science and Information Engineering at Chun Hua University, Hsinchu City, Taiwan 300, R.O.C.

(6)

Abstract.

The data redistribution problems on multi-computers had been extensively studied.

Irregular data redistribution has been paid attention recently since it can distribute different

size of data segment of each processor to processors according to their own computation

capability. High Performance Fortran Version 2 (HPF-2) provides GEN_BLOCK data

distribution method for generating irregular data distribution. In this thesis, we develop an

efficient scheduling algorithm, Smallest Conflict Points Algorithm (SCPA), to schedule

HPF2 irregular array redistribution. SCPA is a near optimal scheduling algorithm, which

satisfies the minimal number of steps and minimal total messages size of steps for irregular

data redistribution. The algorithm intends to decrease the computation costs by firstly

scheduling the conflict messages with maximum degree in the first phase of data

redistribution process and then followed by arranging the rest messages of processors in a

non-decreasing order of message size into the schedule of data redistribution. To evaluate

the performance of our proposed algorithm, we have implemented SCPA along with the

divide-and-conquer algorithm. The simulation results show that SCPA has significant

improvement of communication costs compared with the divide-and-conquer algorithm.

Keywords: Irregular data redistribution, communication scheduling, GEN_BLOCK, conflict

(7)

摘要

由於資料量日趨龐大，計算也日益複雜，因此分散式計算在生物、工業和科學等

領域，已經被大量的運用，以取得更好的效能。

資料重新分配的問題，現在已經被廣泛的研究和探討。其中不規則的資料重新分

配，在近幾年內更是大家所注意的方向。因為它可以將資料以處理器的能力不同，將

不相同的資料量分配至各處理器。在本論文中，我們提出一個有效的排程演算法，使

用最少衝突點演算法(Smallest Conflict Points Algorithm, SCPA)。在不規則資料重新分

配中 SCPA 能找出多數的最佳排程的演算法，同時也滿足最少的步驟和最少的各步驟

總合傳輸量，這 2 個條件是不規則資料重新分配中最重要的目標。SCPA 在第一步驟

中只從最大階層中所屬處理器，找出所有可能的衝突點，放進相同的通訊步驟中，進

而降低計算的複雜度。其餘不屬於衝突點的資料，在衝突點的限制下，依照資料量由

大到小，依序排進不同的通訊步驟中。由 SCPA 得到的通訊排程，能夠達到最小的通

訊步驟和最少的各步驟總合傳輸量。

為了得到測試 SCPA 的效益，我們實作出 SCPA，並將其和 Divide-and-Conquer

Algorithm 相比較。從模擬的數據中，可以得知 SCPA 有較佳的排程結果。

關鍵字：不規則資料分配，通訊排程，衝突點，最小通訊步驟，最少的各步驟總合傳

(8)

List of Figures

Figure 3-1 The communications between source and destination processor sets.17

Figure 3-2 Maximum Degree Messages Set...20

Figure 3-3 Explicit conflict point...20

Figure 3-4 Implicit conflict point...20

Figure 3-5 All MDMSs for the example in Figure 3-1...21

Figure 4-1 Results of MDMSs for Figure 3-1 ...25

Figure 4-2 The schedule obtained form SCPA ...25

Figure 5-1 The events percentage of computing time is plotted with different number of processors ...28

Figure 5-2 The events percentage of computing time is plotted with different of total messages size in 8 processors, on uneven data set. ...29

Figure 5-3 The events percentage of computing time is plotted with different number of processors ...29

(11)

Figure 5-4 The events percentage of computing time is plotted with different of

total messages size in 8 processors, on even data set ...30

Figure 5-5 The events percentage of communication time is plotted with

different number of processors uneven data set...30

different of total messages size in 8 processors, on uneven data set .31

different number of processors ...31

different of total messages size in 8 processors, on even data set. ....32

Figure 5-9 Maximum degree of (3-15) redistribution...33

Figure 5-10 Maximum degree of (7-13) redistribution...34

(12)

List of Tables

Table 1-1 18 elements and 3 logical processors ...5

Table 1-2 BLOCK-CYCLIC (1) to BLOCK-CYCLIC (2) on 3 processors...5

Table 1-3 The irregular data redistribution...7

Table 3-1 An example of source and destination distributions in irregular array redistribution ...16

Table 3-2 A simple scheduling ...17

Table 5-1 The detail of the maximum degree in Figure 5 9 ...33

Table 5-2 The detail of the maximum degree in Figure 5 10 ...34

(13)

Chapter 1 Introduction

More and more works had large data or complex computation on run-time in most scientific and engineering

application. Those kinds of tasks require parallel programming on distributed system. Appropriate data

distribution is critical for efficient execution of a data parallel program on a distributed multi-processor

computing environment. Therefore, an efficient data redistribution communication algorithm is needed to

relocate the data among different processors. Data redistribution can be classified into two categories: the

regular data redistribution [1,4,5,6,8,10,12,14,17] and the irregular data redistribution [3,7,21,22,23]. The

regular data redistribution had type of BLOCK, CYCLIC, and BLOCK-CYCLIC (n) to specify data

decomposition. Table 1-1 presents regular data distribution of BLOCK, CYCLIC, and BLOCK-CYCLIC (n)

with 18 elements and three logical processors. The regular data redistribution has a characteristic of the cycle

repeat redistribution. Hence the regular data redistribution can focus on the one cycle. When

BLOCK-CYCLIC (n) mapping to BLOCK-CYCLIC (m) on p processors, every n*m*p elements is a cycle

repeat.

Table 1-2 presents the regular data redistribution of BLOCK-CYCLIC (1) to

BLOCK-CYCLIC (2) on three processors.

(14)

Table 1-1 18 elements and 3 logical processors

Global-index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

BLOCK P0 P1 P2

CYCLIC P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2

BLOCK-CYCLIC (2)

P0 P1 P2 P0 P1 P2 P0 P1 P2

BLOCK-CYCLIC (3)

P₀ P₁ P₂ P₀ P₁ P₂

Table 1-2 BLOCK-CYCLIC (1) to BLOCK-CYCLIC (2) on 3 processors

Global-index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

BLOCK-CYCLIC

(1) P₀ P₁ P₂ P₀ P₁ P₂ P₀ P₁ P₂ P₀ P₁ P₂ P₀ P₁ P₂ P₀ P₁ P₂

BLOCK-CYCLIC

(2) P0 P1 P2 P0 P1 P2 P0 P1 P2

In many situations, data distribution is not average distribution. Hence, this kind of

irregular distribution has got more focus. The irregular distribution uses user-defined

(15)

functions to specify unevenly data distribution. High Performance Fortran version 2 (HPF2)

provides GEN_BLOCK data distribution instruction which facilitates generalized

unequal-size consecutive segments of array mapping onto consecutive processors. This

makes it possible to let different processors dealing with appropriate data quantity according

to their computation capability. The data redistribution is often irregular and changes at

run-time. Each processor should know which local data of the array send to specified

processor. At the same time, processors know which element of array will be received from

some specified processors.

In the regular array redistribution, Hsu et al. [14] proposed an Optimal Processor

Mapping (OPM) scheme to minimize data transmission cost for general BLOCK-CYCLIC

regular data realignment. Optimal Processor Mapping (OPM) used maximum matching of

realignment logical processors to achieve the maximum data hits to reduce the amount of

data exchanged transmission cost. In the irregular array redistribution, Guo et al. [22]

proposed a Divide-and-Conquer algorithm, they utilize Divide and Conquer technique to

obtain near optimal scheduling while satisfied minimize the total communication messages

size and minimize the number of steps. Table 1-3 presents the irregular data redistribution

when S= {6, 8, 4} and D= {4, 5, 9} on three processors.

(16)

Table 1-3 The irregular data redistribution

Global-index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Source P0 P0 P0 P0 P0 P0 P1 P1 P1 P1 P1 P1 P1 P1 P2 P2 P2 P2

Destination P₀ P₀ P₀ P₀ P₁ P₁ P₁ P₁ P₁ P₂ P₂ P₂ P₂ P₂ P₂ P₂ P₂ P₂

In this thesis, we present a smallest-conflict-points algorithm (SCPA) to efficiently

perform GEN_BLOCK array redistribution. The main idea of the smallest-conflict-points

algorithm is to schedule satisfy the minimal number of steps and the minimal the total steps

of message sizes with maximum degree in the first step of data redistribution process and

then followed by arranging the rest of messages in a non-decreasing order of message size

into the schedule of data redistribution. Smallest-conflict-points algorithm can effectively

reduce communication time in the process of data redistribution. Smallest-conflict-points

algorithm is not only an optimal algorithm in the term of minimal number of steps, but also

a near optimal algorithm satisfied the condition of minimal message size of total steps.

The rest of this thesis is organized as follows. In Section 2, a brief survey of related

work will be presented. In section 3, we will introduce communication model of irregular

data redistribution and give an example of GEN_BLOCK array redistribution as

(17)

preliminary. Section 4 presents smallest-conflict-points algorithm for irregular

redistribution problem. The performance analysis and simulation results will be presented

in section 5. Finally, the conclusions will be given in section 6.

(18)

Chapter 2 Related Work

Many data redistribution results have been proposed in the literature. These researches

are usually developed for regular or irregular problems [3] in multi-computer compiler

techniques or runtime support techniques.

2.1 Regular redistribution

In general, techniques for regular data redistribution can be classified into two

approaches: the communication sets identification techniques and communication

optimizations. The former includes the PITFALLS [19] and the ScaLAPACK [18] for

index sets generation; Park et al. [17] proposed algorithms for BLOCK-CYCLIC Data

redistribution between processor sets; Dongarra et al. [16] and Bandera et al. [1] presented

algorithmic redistribution methods for BLOCK-CYCLIC decompositions and parallel

sparse redistribution code for BLOCK-CYCLIC data redistribution based on CRS structure.

The Generalized Basic-Cycle Calculation method was presented in [13].

Techniques for communication optimizations category provide different approaches to

reduce the communication overheads [8, 11, 15] in a redistribution operation. The

multiphase redistribution strategy [9] for reducing message startup cost, the communication

(19)

scheduling approaches [2, 6, 12, 23] for avoiding node contention and the strip mining

approach [20] for overlapping communication and computational overheads.

2.2 Irregular redistribution

In irregular array redistribution problem, some works have concentrated on the

indexing and message generation while some has addressed on the communication

efficiency. Guo et al. [7] presented a symbolic analysis method for communication set

generation and to reduce communication cost of irregular array redistribution. On

communication efficiency, Lee et al. [11] presented a logical processor reordering algorithm

on irregular array redistribution. Guo et al. [21, 22] proposed a divide-and-conquer

algorithm for performing irregular array redistribution. In this method, communication

messages are first divided into groups using Neighbor Message Set (NMS), messages have

the same sender or receiver; the communication steps will be scheduled after those NMSs

are merged according to the relationship of contention. Yook and Park [23] presented a

relocation algorithm, while their algorithm may lead to high scheduling overheads and

degrade the performance of a redistribution algorithm.

In this thesis, we presented an efficient algorithm use the smallest conflict points to get

(20)

divide-and-conquer algorithm. The divide-and-conquer algorithm used neighbor message

set (NMS) which is a set of the more than one messages that to be sent from same processor

or to be received by same processor. It is possible that a message include by two different

NMSs which is defined as a link message.

(21)

Chapter 3 Preliminaries and Redistribution communication models

3.1 Divide and Conquer Algorithm

Divide and Conquer algorithm can generate sub-optimal communication scheduling

table. They tried to satisfied minimal size condition and minimal step condition.

Divide-and-conquer algorithm contains three parts: breaking a problem into a set of

sub-problems that are similar to the original problems with smaller size. Solve this

sub-problems recursively. Create a solution to original problem by combining these

solutions of sub-problems together. Each NMS is a unit of sub-problems. It group two

neighbor NMSs as a pair. After obtaining the pairs of NMSs, begin merge processes to

produce the scheduling. First, recursively merge neighboring NMSs pairs. Then, generate

the scheduling tables for each group. If detects the conflict information, pick up link

message and those messages scheduled with it. Then sort the remaining messages. Finally,

the link message put into scheduling with those scheduled messages. The algorithm is

described below.

Divide-and-conquer Algorithm:

(22)

{

Separation NMS1, NMS2, …, NMSk; // k: total number of NMSs

// First phase: separation and grouping for (i = 0; i < k; i += 2) {

picking left-down link message out, putting it aside;

tmpMax(i) = Join(NMSi+1, NMSi+2);

adding left-down link message to tmpMax(i);

updating S(p, q);

}

// Second phase: recursively merging j=2;

while (j < k) {

for (i = j; i < k; i += 2 * j) {

if (Order(i in NMSi) != Order(i in NMSi+1)) { Merge(sub-matrix from M(i-j) to M(i+j));

} }

j = 2 * j;

} }

Merging Algorithm (subroutine):

Range: sub-matrix from M(i-j) to M(i+j) Location: link message i

{

tmpOrder = 0;

while (1) {

//tmpNum = number of elements in link message i’s

//row or column with tmpOrder; if any element is larger //than link message i, tmpNum++;

if (tmpOrder == Max(number of messages in i’s row,

number of messages in i’s column){

S[p[i]][q[i]] = tmpOrder;

break;

}

else if (tmpNum == 0) {

S[p[i]][q[i]] = tmpOrder;

break;

(23)

}

else if (tmpNum == 1) {

relocation with minimum costs among { order(i) == Order(i in NMSi+1), order(i) == Order(i in NMSi)

order(i) when relocating link message i’s row and column

};

updating S(x,y) in sub-matrix from M(i-j) to M(i+j);

break;

}

tmpOrder++;

} }

3.2 Communication models

Data redistribution is a set of routines that transfer all the elements in a set of source

processor S to a set of destination processor T. The sizes of the messages are specified by

values of user-defined random integer for array mapping from source processor to

destination processor. Since node contention considerably influences, a processor can only

send messages to other one processor in each communication step. Use the same rule, a

processor can only receive messages from other one processor.

To simplify the presentation, notations and terminologies used in this thesis are prior

defined as follows.

(24)

Definition 1：GEN_BLOCK redistribution on one dimension array A[1:N] over P processors.

The source processor is denoted as SP_i. the destination processor is denoted as DP_j, where 0

≦ i, j ≦ P-1

Definition 2： The time of redistribution separator the time of startup is denoted as ts, and

the time of communication is denoted as t_comm.

Definition 3 ： To satisfy the condition of the minimum steps and the processor

sends/receives one message at each steps, some messages can not be scheduled in the same

communication step are called conflict tuple [22].

Data redistribution implements have two methods: non-blocking scheduling algorithm and

blocking scheduling algorithm. The non-blocking scheduling algorithm is faster than the

blocking scheduling algorithm. But it needs more buffer and control synchronization well.

In this thesis, we discuss on blocking scheduling algorithm.

Irregular data redistribution is unlike regular has a cyclic message passing pattern.

Every message transmission link is not overlapping. Hence, the total number of message links N isnumprocs≤N ≤ ×2 numprocs−1, where numprocs is the number of processors.

Table 3-1 shows an example of redistributing two GEN_BLOCK distributions on an array

A[1:101]. The communications between source and destination processor sets are depicted

(25)

in Figure 3-1. There are totally fifteen communication messages, m₁, m₂, m₃…, m₁₅ among

processors involved in the redistribution. In this example, {m₂, m₃, m₄} is a conflict tuple

since they have common source processor SP₁; {m₇, m₈, m₉} is also a conflict point because

of the common destination processor DP₄. The maximum degree in the example is equal

to 3. Table 3-2 shows a simple schedule for this example.

Table 3-1 An example of source and destination distributions in irregular array redistribution

Source distribution

Source Processor

SP SP0 SP1 SP2 SP3 SP4 SP5 SP6 SP7

Size 12 20 15 14 11 9 9 11

Destination distribution

Destination Processor

DP DP₀ DP₁ DP₂ DP₃ DP₄ DP₅ DP₆ DP₇

Size 17 10 13 6 17 12 11 15

(26)

Figure 3-1 The communications between source and destination processor sets

Table 3-2 A simple scheduling

Schedule Table

Step 1 m2 m5 m9 m12 m14

Step 2 m1 m3 m6 m8 m11 m15

Step 3 m4 m7 m10 m13

3.3 Explicit conflict tuple and implicit conflict tuple

The total communication time of a message passing operation using two parameters:

the startup time ts and the unit data transmission time tm. The startup time is once for each SP0 SP1 SP2 SP3 SP4 SP5 SP6 SP7

DP0 DP1 DP2 DP3 DP4 DP5 DP6 DP7

m₁ m₃m₅m7 m₉m₁₁m₁₃m₁₅

12 5 10 5 8 6 1 14 2 9 3 6 5 4 11

(27)

communication event and is independent of the message size to be communicated. The data

transmission time (t_comm) is relationship of a message size, size(m).

( )

c o m m s m

t = t + s ize m × t (1)

The communication time of one communication step t_comm(step_i)is the maximum of the

message in this step.

( ) m a x ( ( ) )

i

c o m m i c o m m

m s t e p

t s t e p t m

= ∈ (2)

The total communication time of all steps (t_comm(total))is summary of each the

communication time of step. The minimum step: k is proportional to the maximum degree.

0

( ) ( )

c o m m c o m m i

i k

t t o t a l t s t e p

< ≤

=

∑

⁽³⁾

The length of these steps determines the data transmission overheads. The minimum

step is equal to maximum degree k, when message can not put into any step of minimum

step it must relate to the processor has maximum degree transmission links. Figure 3-2

shows the maximum degree of Figure 3-1. SP₁, SP₂ and DP₄ had maximum degree (K = 3)

from messages m₂~m₉. Because of each one processor can only send/receive at most one

message to/from other processor in each communication step. First, we concentrate all

(28)

the thesis, as shown in Figure 3-2. If the messages in MDMSs can put into k steps with no

conflict occur, other messages of the processors’ degree less than maximum degree will be

easier to put into the rest of step without increasing the number of steps.

We say a message to be an explicit conflict point if it belongs to two MDMSs. There

exists at most one explicit conflict point between two MDMSs. In Figure 3-2, m₇ is an

explicit conflict point since it belongs to two MDMSs {m₅, m₆, m₇} and {m₇, m₈, m₉}. On

the other hand, if two MDMSs do not contain the same message, but the neighbor MDMSs

each has a message been sent by the same processor, or been received by the same

processor. We call this kind of message as an implicit conflict point. As shown by Figure

3-3, m₄ and m₅ are contained by the different MDMSs. DP₂ only receives m₄ and m₅ two

messages, so it can not form an MDMS. But m₄ and m₅ are also owned by different

MDMSs. Therefore, m₄ is an implicit conflict point. Although, m₅ is also covered by two

MDMSs, but it is restricted by m₄. Hence m₅ will not cause conflict. Figure 3-5 depicts all

MDMSs for the example shown in Figure 3-1.

(29)

Figure 3-2 Maximum Degree Messages Set

Figure 3-3 Explicit conflict point

Figure 3-4 Implicit conflict point

m₂m₃ m4 m₅ m₆ m₇

1 2 4 3 1 2

m₅ m₆ m₇ m₈ m₉

1 2 4 3 1

m₂ m₃m₄ m₅ m₆ m₇m₈ m₉

(30)

Figure 3-5 All MDMSs for the example in Figure 3-1

m₁m₂ m₃

m

₄m₅m₆

m

₇m₈ m₉ m₁₀ m₁₁ m₁₂ m₁₃

(31)

Chapter 4 Scheduling Algorithm

The main goal of irregular array distribution is to minimize communication step as

well as the total message size of steps. The conflict points chosen in the Divide and

Conquer Algorithm [19] are the messages covered by two NMSs. But some of the messages

do not really cause conflict. Therefore, we select the smallest conflict points which will

really cause conflict to loose the schedule constraint and to minimize the total message size

of schedule.

Smallest conflict points algorithm consists of four parts:

(1) Pick out MDMSs from given data redistributed problem.

(2) Find out explicit conflict point and implicit conflict point. And schedule all the conflict

point into the same schedule step.

(3) Select messages on MDMSs in non-increasing order of message size. Schedule message

into similar message size of that step and keep the relation of each processor send/receive at

most one message to/from the processor. Repeat above process until no MDMSs’ messages

left.

(32)

(4) Schedule messages do not belong to MDMSs by non-increasing order of message size.

Repeat above process until no messages left.

From Figure 4-1, we can pick out four MDMSs, MDMS₁ = {m₂, m₃, m₄}, MDMS₂=

{m₄, m₅}, MDMS₃= {m₅, m₆, m₇} and MDMS₄= {m₇, m₈, m₉}, shown in Figure 4-1. Then

schedule m₄ and m₇ into the same step. Then schedule those messages on MDMSs by

non-increasing order of message size as follows: m₈, m₃, m₅, m₆, m₂, m₉. After that, we can

schedule the rest messages that are not belong to any MDMSs by non-increasing order of

message size as follows: m₁, m₁₅, m₁₀, m₁₂, m₁₃, m_14, m₁₁. Figure 4-2 shows the final

schedule obtained form smallest conflict points algorithm.

The Smallest conflict points algorithm is given as follows.

========================================================

1 Pick out MDMSs Algorithm:

2 {

3 j=0;

4 //numprocs: total number of processors 5 for(i = 0; i < numprocs; i++) { 6 if( degree of i == maxdegree ) 7 {

8 pick i own messages into MDMS_j;

9 j++;

(33)

11 else 12 {

13 if( i own messages also be owned by two MDMSs ) 14 {

15 pick i own messages into MDMS_j; 16 j++;

17 } 18 } 19 } 20}

21

22 Find conflict point algorithm:

23{// m: the count of the MDMS 24 for (i = 0; i < m-1; ) 25 {

26 if( MDMS_i has the same message of MDMS_i+1 )

27 {

28 this same message is conflict point;

29 i += 2;

30 } 31 else 32 i++

33 } 34}

========================================================

(34)

S1: m8 m3 m5 m1 m15 m10 m12

S2: m6 m2 m9 m13 m11

S3: m4 m7 m14

Figure 4-1 Results of MDMSs for Figure 3-1

Figure 4-2 The schedule obtained form SCPA

m₂ m₃ m₄ m₅ m₆ m₇ m₈ m₉

S₁: m₈ m₃ m₅ S₂: m₆ m₂ m₉ S₃: m₄m₇

(35)

Chapter 5 Performance Evaluation and Analysis

To evaluate the performance of the proposed methods, we have implemented the SCPA

along with the divide-and-conquer algorithm [22]. The performance simulation is

discussed in two classes, even GEN_BLOCK and uneven GEN_BLOCK distributions. In

even GEN_BLOCK distribution, each processor owns similar size of data. Contrast to even

distribution, few processors might be allocated grand volume of data in uneven distribution.

Since array elements could be centralized to some specific processors, it is also possible for

those processors to have the maximum degree of communications.

The simulation program generates a set of random integer number as the size of

message. Also, we set the number of source processors equals to the number of target

processors in order to avoid some processors do not have message. Moreover, the total

messages size of source processors is equals to the total size of target processors to keep the

balance between source processors and target processors.

To correctly evaluate the performance of these two algorithms, both programs were

written in the single program multiple data (SPMD) programming paradigm with MPI code

and executed on an SMP/Linux cluster consisted of 24 SMP nodes. In the following figures,

(36)

processors. Also, in the figures, “SCPA Better” represents the percentage of the number of

events that the SCPA has lower total computation (communication) time than the

divide-and-conquer algorithm, while “DCA Better” gives the reverse situation. If both

algorithms have same total computation (communication) time, “The Same Results”

represents the number of that event. In the uneven distribution, the size of message’s

up-bound is set to (totalsize/numprocs)*1.5 and low-bound is set to

(totalsize/numprocs)*0.3, where totalsize is total size of messages and numprocs is the size

of processor. In the even distribution, the size of message’s up-bound is set to

(totalsize/numprocs)*1.3 and low-bound is set to low-bound is (totalsize/numprocs)*0.7.

The total messages size is 1M.Figure 5-1 to Figure 5-4 show the simulation results of both

the smallest-conflict-points algorithm (SCPA) and the divide-and-conquer distribution

algorithm with different number of processors and total message size. The number of

processors is from 8 to 24. We can observe that SCPA algorithm has better performance on

uneven data redistribution compared with divide-and-conquer algorithm.

From

Figure 5-3 and Figure 5-4 data in the even case, we can observe that SCPA have the

better performance compared with uneven case.

(37)

Figure 5-3 and Figure 5-4 also illustrates that SCPA has at least 85% supreme than

divide-and-conquer in any size of total messages and any number of processors. Figure 5-5

to Figure 5-8 depict the communication time between SCPA and Divide-and-Conquer

algorithm against the number of processors and total message size in uneven (even) case.

Because the communication cost (time) is calculated according to the real message

exchange. It depends on the bandwidth, I/O, and other factors in the real execution

environment. In both even and uneven case, SCPA performs slightly better than

Divide-and-Conquer algorithm.

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

10000 20000 30000 40000 50000

Total messages size

Event percentage(%)

The Same Results SCPA Better DCA Better

Figure 5-1 The events percentage of computing is plotted with different number of processors

(38)

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

8 12 16 20 24

The number of processors

Event percentage(%)

Figure 5-2 The events percentage of computing is plotted with different of total messages size in 8 processors, on uneven data set.

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

8 12 16 20 24

The messages size

Event percentage(%)

Figure 5-3 The events percentage of computing is plotted with different number of processors

(39)

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

10000 20000 30000 40000 50000

Total messages size

Event percentage(%)

Figure 5-4 The events percentage of computing is plotted with different of total messages size in 8 processors, on even data set

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

10000 20000 30000 40000 50000

The total of messages size

Event percentage(%)

SCPA Better DCA Better

Figure 5-5 The events percentage of communication time is plotted with different number of processors uneven data set

(40)

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

8 12 16 20 24

Event percentage(%)

Figure 5-6 The events percentage of communication time is plotted with different of total messages size in 8 processors, on uneven data set

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

10000 20000 30000 40000 50000

The total of messages size

Event percentage(%)

Figure 5-7 The events percentage of communication time is plotted with different number of processors

(41)

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

8 12 16 20 24

Event percentage(%)

Figure 5-8 The events percentage of communication time is plotted with different of total messages size in 8 processors, on even data set.

Figure 5-9 and Table 5-1 are the maximum degree in different nodes of uneven cases.

Figure 5-10 and Table 5-2 is the maximum degree in different nodes of even cases. These

show that the number of processors is raising, the greater degree is growth. When the

greater degree is growth; the complexity of scheduling is increasing.

(42)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Event percentage(%)

4 8 12 16 20 24

the number of processors maximum degree

maxdegree=6 maxdegree=5 maxdegree=4 maxdegree=3 maxdegree=2 maxdegree=1

Figure 5-9 Maximum degree of (3-15) redistribution

Table 5-1 The detail of the maximum degree in Figure 5-9

Node=4 Node=8 Node=12 Node=16 Node=20 Node=24

Maxdegree=1 0 0 0 0 0 0

Maxdegree=2 0 0 0 0 0 0

Maxdegree=3 4479 825 126 15 3 0

Maxdegree=4 5474 8560 8632 8116 7546 6993

Maxdegree=5 47 614 1240 1862 2434 2988

Maxdegree=6 0 1 2 7 17 19

(43)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Event percentage(%)

4 8 12 16 20 24

the number of processors maxiimum degree

maxdegree=6 maxdegree=5 maxdegree=4 maxdegree=3 maxdegree=2 maxdegree=1

Figure 5-10 Maximum degree of (7-13) redistribution

Table 5-2 The detail of the maximum degree in Figure 5-10

Node=4 Node=8 Node=12 Node=16 Node=20 Node=24

Maxdegree=1 0 0 0 0 0 0

Maxdegree=2 0 0 0 0 0 0

Maxdegree=3 5009 2425 1337 739 408 213

Maxdegree=4 4991 7575 8663 9261 9592 9787

Maxdegree=5 0 0 0 0 0 0

(44)

Chapter 6 Conclusion

In this thesis, we have presented an efficient scheduling algorithm, smallest conflict

points algorithm (SCPA), for irregular data distribution. The algorithm used the smallest

conflict points, let scheduling more elasticity. First, get the maximum degree messages set

(MDMS). Second, pick out the explicit conflict points and implicit conflict points. Then put

conflict points in the same step. Third, put the messages in the MDMSs by decreasing the

length of message. Finally, put the messages in non-MDMSs by the same method of third

step. The algorithm can effectively reduce communication time in the process of data

redistribution. Smallest-conflict-points algorithm is not only an optimal algorithm in the

term of minimal number of steps, but also a near optimal algorithm satisfied the condition

of minimal message size of total steps. Effectiveness of the proposed methods not only

avoids node contention but also shortens the overall communication length.

From the experiments, SCPA had the least 80% better than DCA in uneven case. And

SCPA had the least 89% better than DCA in even case. When the number of processors is

growth, the simulation results SCPA better than DCA.

For verifying the performance of our proposed algorithm, we have implemented SCPA

as well as the divide-and-conquer redistribution algorithm. The experimental results show

(45)

SCPA either communication costs or simulation results both are better than DCA. Also, the

experimental results indicate that both of them have good performance on GEN_BLOCK

redistribution. In the most cases, SCPA has better performance than the divide-and-conquer

redistribution algorithm.

(46)

Reference

[1] G. Bandera and E.L. Zapata, “Sparse Matrix Block-Cyclic Redistribution,” Proceeding of IEEE Int'l. Parallel Processing Symposium (IPPS'99), San Juan, Puerto Rico, April 1999 Page(s):355 - 359.

[2] Frederic Desprez, Jack Dongarra and Antoine Petitet, “Scheduling Block-Cyclic Data redistribution,” IEEE Trans. on PDS, vol. 9, no. 2, pp. 192-205, Feb. 1998.

[3] Minyi Guo, “Communication Generation for Irregular Codes,” The Journal of Supercomputing, vol. 25, no. 3, pp. 199-214, 2003.

[4] Minyi Guo and I. Nakata, “A Framework for Efficient Array Redistribution on Distributed Memory Multicomputers,” The Journal of Supercomputing, vol. 20, no. 3, pp. 243-265, 2001.

[5] Minyi Guo, I. Nakata and Y. Yamashita, “Contention-Free Communication Scheduling for Array Redistribution,” Parallel Computing, vol. 26, no.8, pp. 1325-1343, 2000.

[6] Minyi Guo, I. Nakata and Y. Yamashita, “An Efficient Data Distribution Technique for Distributed Memory Parallel Computers,” JSPP'97, pp.189-196, 1997.

[7] Minyi Guo, Yi Pan and Zhen Liu, “Symbolic Communication Set Generation for Irregular Parallel Applications,” The Journal of Supercomputing, vol. 25, pp. 199-214, 2003.

[8] Edgar T. Kalns, and Lionel M. Ni, “Processor Mapping Technique Toward Efficient Data Redistribution,” IEEE Trans. on PDS, vol. 6, no. 12, pp. 1234-1247, December

(47)

[9] S. D. Kaushik, C. H. Huang, J. Ramanujam and P. Sadayappan, “Multiphase data redistribution: Modeling and evaluation,” Proceeding of IPPS’95, pp. 441-445, 1995.

[10] Peizong Lee, Academia Sinica, and Zvi Meir Kedem, “Automatic Data and Computation Decomposition on Distributed Memory Parallel Computers,” ACM Transactions on Programming Languages and systems, Vol 24, No. 1, pp. 1-50, January 2002.

[11] S. Lee, H. Yook, M. Koo and M. Park, “Processor reordering algorithms toward efficient GEN_BLOCK redistribution,” Proceedings of the ACM symposium on Applied computing, pp . 539-543, 2001.

[12] Y. W. Lim, Prashanth B. Bhat and Viktor and K. Prasanna, “Efficient Algorithms for Block-Cyclic Redistribution of Arrays,” Algorithmica, vol. 24, no. 3-4, pp. 298-330, 1999.

[13] C.-H Hsu, S.-W Bai, Y.-C Chung and C.-S Yang, “A Generalized Basic-Cycle Calculation Method for Efficient Array Redistribution,” IEEE TPDS, vol. 11, no. 12, pp. 1201-1216, Dec. 2000.

[14] Ching-Hsien Hsu, Kun-Ming Yu, “An Optimal Processor Replacement Scheme for Efficient Communication of Runtime Data Realignment,” pp. 268-273, 2004.

[15] C.-H Hsu, Dong-Lin Yang, Yeh-Ching Chung and Chyi-Ren Dow, “A Generalized Processor Mapping Technique for Array Redistribution,” IEEE Transactions on Parallel and Distributed Systems, vol. 12, vol. 7, pp. 743-757, July 2001.

(48)

[16] Antoine P. Petitet and Jack J. Dongarra, “Algorithmic Redistribution Methods for Block-Cyclic Decompositions,” IEEE Trans. on PDS, vol. 10, no. 12, pp. 1201-1216, Dec. 1999

[17] Neungsoo Park, Viktor K. Prasanna and Cauligi S. Raghavendra, “Efficient Algorithms for Block-Cyclic Data redistribution Between Processor Sets,” IEEE Transactions on Parallel and Distributed Systems, vol. 10, No. 12, pp.1217-1240, Dec.

1999.

[18] .L. Prylli and B. Touranchean, “Fast runtime block cyclic data redistribution on multiprocessors,” Journal of Parallel and Distributed Computing, vol. 45, pp. 63-72, Aug. 1997.

[19] S. Ramaswamy, B. Simons, and P. Banerjee, “Optimization for Efficient Data redistribution on Distributed Memory Multicomputers,” Journal of Parallel and Distributed Computing, vol. 38, pp. 217-228, 1996.

[20] Akiyoshi Wakatani and Michael Wolfe, “Optimization of Data redistribution for Distributed Memory Multicomputers,” short communication, Parallel Computing, vol.

21, no. 9, pp. 1485-1490, September 1995.

[21] Hui Wang, Minyi Guo and Wenxi Chen, “An Efficient Algorithm for Irregular Redistribution in Parallelizing Compilers,” Proceedings of 2003 International Symposium on Parallel and Distributed Processing with Applications, LNCS 2745, 2003.

(49)

[22] Hui Wang, Minyi Guo and Daming Wei, "Divide-and-conquer Algorithm for Irregular Redistributions in Parallelizing Compilers”, The Journal of Supercomputing, vol. 29, no. 2, pp. 157-170, 2004.

[23] H.-G. Yook and Myung-Soon Park, “Scheduling GEN_BLOCK Array Redistribution,”

Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems, November, 1999.

t OGTu{thZ mWGM092020035 G [ h EQ| ~ C An Efficient Communication Scheduling for Contention-Free Irregular Data Redistribution DGbWhstqT{ h j

中 華 大 學 碩 士 論 文

An Efficient Communication Scheduling for Contention-Free Irregular Data

Redistribution

Abstract.

Contents

List of Figures

List of Tables

Chapter 1 Introduction

Chapter 2 Related Work

Chapter 3 Preliminaries and Redistribution communication models

∑

m

m

Chapter 4 Scheduling Algorithm

Chapter 5 Performance Evaluation and Analysis

Chapter 6 Conclusion

Reference

中華大學碩士論文