中 華 大 學 碩 士 論 文
題目:在不規則資料重新分配中建立有效的通訊排程
An Efficient Communication Scheduling for Contention-Free Irregular Data Redistribution
系 所 別:資訊工程學系碩士班 學號姓名:M092020035 陳啟修 指導教授:游 坤 明 博士
An Efficient Communication Scheduling for Contention-Free Irregular Data
Redistribution
by
Chi-Hsiu Chen
B.S., Chung Hua University, 2004
Advisor: Prof. Kung-Ming Yu
Thesis submitted in partial fulfillment of requirements for the Degree of Master in
Department of Computer Science and Information Engineering at Chun Hua University, Hsinchu City, Taiwan 300, R.O.C.
Abstract.
The data redistribution problems on multi-computers had been extensively studied.
Irregular data redistribution has been paid attention recently since it can distribute different
size of data segment of each processor to processors according to their own computation
capability. High Performance Fortran Version 2 (HPF-2) provides GEN_BLOCK data
distribution method for generating irregular data distribution. In this thesis, we develop an
efficient scheduling algorithm, Smallest Conflict Points Algorithm (SCPA), to schedule
HPF2 irregular array redistribution. SCPA is a near optimal scheduling algorithm, which
satisfies the minimal number of steps and minimal total messages size of steps for irregular
data redistribution. The algorithm intends to decrease the computation costs by firstly
scheduling the conflict messages with maximum degree in the first phase of data
redistribution process and then followed by arranging the rest messages of processors in a
non-decreasing order of message size into the schedule of data redistribution. To evaluate
the performance of our proposed algorithm, we have implemented SCPA along with the
divide-and-conquer algorithm. The simulation results show that SCPA has significant
improvement of communication costs compared with the divide-and-conquer algorithm.
Keywords: Irregular data redistribution, communication scheduling, GEN_BLOCK, conflict
摘要
由於資料量日趨龐大,計算也日益複雜,因此分散式計算在生物、工業和科學等
領域,已經被大量的運用,以取得更好的效能。
資料重新分配的問題,現在已經被廣泛的研究和探討。其中不規則的資料重新分
配,在近幾年內更是大家所注意的方向。因為它可以將資料以處理器的能力不同,將
不相同的資料量分配至各處理器。在本論文中,我們提出一個有效的排程演算法,使
用最少衝突點演算法(Smallest Conflict Points Algorithm, SCPA)。在不規則資料重新分
配中 SCPA 能找出多數的最佳排程的演算法,同時也滿足最少的步驟和最少的各步驟
總合傳輸量,這 2 個條件是不規則資料重新分配中最重要的目標。SCPA 在第一步驟
中只從最大階層中所屬處理器,找出所有可能的衝突點,放進相同的通訊步驟中,進
而降低計算的複雜度。其餘不屬於衝突點的資料,在衝突點的限制下,依照資料量由
大到小,依序排進不同的通訊步驟中。由 SCPA 得到的通訊排程,能夠達到最小的通
訊步驟和最少的各步驟總合傳輸量。
為了得到測試 SCPA 的效益,我們實作出 SCPA,並將其和 Divide-and-Conquer
Algorithm 相比較。從模擬的數據中,可以得知 SCPA 有較佳的排程結果。
關鍵字:不規則資料分配,通訊排程,衝突點,最小通訊步驟,最少的各步驟總合傳
Contents
Abstract...i
摘要...ii
List of Figures...1
List of Tables ...3
Chapter 1 Introduction...4
Chapter 2 Related Work ...9
2.1 Regular redistribution ...9
2.2 Irregular redistribution...10
Chapter 3 Preliminaries and Redistribution communication models ...12
3.1 Divide and Conquer Algorithm ...12
3.2 Communication models ...14
3.3 Explicit conflict tuple and implicit conflict tuple ...17
Chapter 4 Scheduling Algorithm...22
Chapter 6 Conclusion ...35
Reference ...37
List of Figures
Figure 3-1 The communications between source and destination processor sets.17
Figure 3-2 Maximum Degree Messages Set...20
Figure 3-3 Explicit conflict point...20
Figure 3-4 Implicit conflict point...20
Figure 3-5 All MDMSs for the example in Figure 3-1...21
Figure 4-1 Results of MDMSs for Figure 3-1 ...25
Figure 4-2 The schedule obtained form SCPA ...25
Figure 5-1 The events percentage of computing time is plotted with different number of processors ...28
Figure 5-2 The events percentage of computing time is plotted with different of total messages size in 8 processors, on uneven data set. ...29
Figure 5-3 The events percentage of computing time is plotted with different number of processors ...29
Figure 5-4 The events percentage of computing time is plotted with different of
total messages size in 8 processors, on even data set ...30
Figure 5-5 The events percentage of communication time is plotted with
different number of processors uneven data set...30
Figure 5-6 The events percentage of communication time is plotted with
different of total messages size in 8 processors, on uneven data set .31
Figure 5-7 The events percentage of communication time is plotted with
different number of processors ...31
Figure 5-8 The events percentage of communication time is plotted with
different of total messages size in 8 processors, on even data set. ....32
Figure 5-9 Maximum degree of (3-15) redistribution...33
Figure 5-10 Maximum degree of (7-13) redistribution...34
List of Tables
Table 1-1 18 elements and 3 logical processors ...5
Table 1-2 BLOCK-CYCLIC (1) to BLOCK-CYCLIC (2) on 3 processors...5
Table 1-3 The irregular data redistribution...7
Table 3-1 An example of source and destination distributions in irregular array redistribution ...16
Table 3-2 A simple scheduling ...17
Table 5-1 The detail of the maximum degree in Figure 5 9 ...33
Table 5-2 The detail of the maximum degree in Figure 5 10 ...34
Chapter 1 Introduction
More and more works had large data or complex computation on run-time in most scientific and engineering
application. Those kinds of tasks require parallel programming on distributed system. Appropriate data
distribution is critical for efficient execution of a data parallel program on a distributed multi-processor
computing environment. Therefore, an efficient data redistribution communication algorithm is needed to
relocate the data among different processors. Data redistribution can be classified into two categories: the
regular data redistribution [1,4,5,6,8,10,12,14,17] and the irregular data redistribution [3,7,21,22,23]. The
regular data redistribution had type of BLOCK, CYCLIC, and BLOCK-CYCLIC (n) to specify data
decomposition. Table 1-1 presents regular data distribution of BLOCK, CYCLIC, and BLOCK-CYCLIC (n)
with 18 elements and three logical processors. The regular data redistribution has a characteristic of the cycle
repeat redistribution. Hence the regular data redistribution can focus on the one cycle. When
BLOCK-CYCLIC (n) mapping to BLOCK-CYCLIC (m) on p processors, every n*m*p elements is a cycle
repeat.
Table 1-2 presents the regular data redistribution of BLOCK-CYCLIC (1) to
BLOCK-CYCLIC (2) on three processors.
Table 1-1 18 elements and 3 logical processors
Global-index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
BLOCK P0 P1 P2
CYCLIC P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2
BLOCK-CYCLIC (2)
P0 P1 P2 P0 P1 P2 P0 P1 P2
BLOCK-CYCLIC (3)
P0 P1 P2 P0 P1 P2
Table 1-2 BLOCK-CYCLIC (1) to BLOCK-CYCLIC (2) on 3 processors
Global-index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
BLOCK-CYCLIC
(1) P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2
BLOCK-CYCLIC
(2) P0 P1 P2 P0 P1 P2 P0 P1 P2
In many situations, data distribution is not average distribution. Hence, this kind of
irregular distribution has got more focus. The irregular distribution uses user-defined
functions to specify unevenly data distribution. High Performance Fortran version 2 (HPF2)
provides GEN_BLOCK data distribution instruction which facilitates generalized
unequal-size consecutive segments of array mapping onto consecutive processors. This
makes it possible to let different processors dealing with appropriate data quantity according
to their computation capability. The data redistribution is often irregular and changes at
run-time. Each processor should know which local data of the array send to specified
processor. At the same time, processors know which element of array will be received from
some specified processors.
In the regular array redistribution, Hsu et al. [14] proposed an Optimal Processor
Mapping (OPM) scheme to minimize data transmission cost for general BLOCK-CYCLIC
regular data realignment. Optimal Processor Mapping (OPM) used maximum matching of
realignment logical processors to achieve the maximum data hits to reduce the amount of
data exchanged transmission cost. In the irregular array redistribution, Guo et al. [22]
proposed a Divide-and-Conquer algorithm, they utilize Divide and Conquer technique to
obtain near optimal scheduling while satisfied minimize the total communication messages
size and minimize the number of steps. Table 1-3 presents the irregular data redistribution
when S= {6, 8, 4} and D= {4, 5, 9} on three processors.
Table 1-3 The irregular data redistribution
Global-index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Source P0 P0 P0 P0 P0 P0 P1 P1 P1 P1 P1 P1 P1 P1 P2 P2 P2 P2
Destination P0 P0 P0 P0 P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P2 P2 P2 P2
In this thesis, we present a smallest-conflict-points algorithm (SCPA) to efficiently
perform GEN_BLOCK array redistribution. The main idea of the smallest-conflict-points
algorithm is to schedule satisfy the minimal number of steps and the minimal the total steps
of message sizes with maximum degree in the first step of data redistribution process and
then followed by arranging the rest of messages in a non-decreasing order of message size
into the schedule of data redistribution. Smallest-conflict-points algorithm can effectively
reduce communication time in the process of data redistribution. Smallest-conflict-points
algorithm is not only an optimal algorithm in the term of minimal number of steps, but also
a near optimal algorithm satisfied the condition of minimal message size of total steps.
The rest of this thesis is organized as follows. In Section 2, a brief survey of related
work will be presented. In section 3, we will introduce communication model of irregular
data redistribution and give an example of GEN_BLOCK array redistribution as
preliminary. Section 4 presents smallest-conflict-points algorithm for irregular
redistribution problem. The performance analysis and simulation results will be presented
in section 5. Finally, the conclusions will be given in section 6.
Chapter 2 Related Work
Many data redistribution results have been proposed in the literature. These researches
are usually developed for regular or irregular problems [3] in multi-computer compiler
techniques or runtime support techniques.
2.1 Regular redistribution
In general, techniques for regular data redistribution can be classified into two
approaches: the communication sets identification techniques and communication
optimizations. The former includes the PITFALLS [19] and the ScaLAPACK [18] for
index sets generation; Park et al. [17] proposed algorithms for BLOCK-CYCLIC Data
redistribution between processor sets; Dongarra et al. [16] and Bandera et al. [1] presented
algorithmic redistribution methods for BLOCK-CYCLIC decompositions and parallel
sparse redistribution code for BLOCK-CYCLIC data redistribution based on CRS structure.
The Generalized Basic-Cycle Calculation method was presented in [13].
Techniques for communication optimizations category provide different approaches to
reduce the communication overheads [8, 11, 15] in a redistribution operation. The
multiphase redistribution strategy [9] for reducing message startup cost, the communication
scheduling approaches [2, 6, 12, 23] for avoiding node contention and the strip mining
approach [20] for overlapping communication and computational overheads.
2.2 Irregular redistribution
In irregular array redistribution problem, some works have concentrated on the
indexing and message generation while some has addressed on the communication
efficiency. Guo et al. [7] presented a symbolic analysis method for communication set
generation and to reduce communication cost of irregular array redistribution. On
communication efficiency, Lee et al. [11] presented a logical processor reordering algorithm
on irregular array redistribution. Guo et al. [21, 22] proposed a divide-and-conquer
algorithm for performing irregular array redistribution. In this method, communication
messages are first divided into groups using Neighbor Message Set (NMS), messages have
the same sender or receiver; the communication steps will be scheduled after those NMSs
are merged according to the relationship of contention. Yook and Park [23] presented a
relocation algorithm, while their algorithm may lead to high scheduling overheads and
degrade the performance of a redistribution algorithm.
In this thesis, we presented an efficient algorithm use the smallest conflict points to get
divide-and-conquer algorithm. The divide-and-conquer algorithm used neighbor message
set (NMS) which is a set of the more than one messages that to be sent from same processor
or to be received by same processor. It is possible that a message include by two different
NMSs which is defined as a link message.
Chapter 3 Preliminaries and Redistribution communication models
3.1 Divide and Conquer Algorithm
Divide and Conquer algorithm can generate sub-optimal communication scheduling
table. They tried to satisfied minimal size condition and minimal step condition.
Divide-and-conquer algorithm contains three parts: breaking a problem into a set of
sub-problems that are similar to the original problems with smaller size. Solve this
sub-problems recursively. Create a solution to original problem by combining these
solutions of sub-problems together. Each NMS is a unit of sub-problems. It group two
neighbor NMSs as a pair. After obtaining the pairs of NMSs, begin merge processes to
produce the scheduling. First, recursively merge neighboring NMSs pairs. Then, generate
the scheduling tables for each group. If detects the conflict information, pick up link
message and those messages scheduled with it. Then sort the remaining messages. Finally,
the link message put into scheduling with those scheduled messages. The algorithm is
described below.
Divide-and-conquer Algorithm:
{
Separation NMS1, NMS2, …, NMSk; // k: total number of NMSs
// First phase: separation and grouping for (i = 0; i < k; i += 2) {
picking left-down link message out, putting it aside;
tmpMax(i) = Join(NMSi+1, NMSi+2);
adding left-down link message to tmpMax(i);
updating S(p, q);
}
// Second phase: recursively merging j=2;
while (j < k) {
for (i = j; i < k; i += 2 * j) {
if (Order(i in NMSi) != Order(i in NMSi+1)) { Merge(sub-matrix from M(i-j) to M(i+j));
} }
j = 2 * j;
} }
Merging Algorithm (subroutine):
Range: sub-matrix from M(i-j) to M(i+j) Location: link message i
{
tmpOrder = 0;
while (1) {
//tmpNum = number of elements in link message i’s
//row or column with tmpOrder; if any element is larger //than link message i, tmpNum++;
if (tmpOrder == Max(number of messages in i’s row,
number of messages in i’s column){
S[p[i]][q[i]] = tmpOrder;
break;
}
else if (tmpNum == 0) {
S[p[i]][q[i]] = tmpOrder;
break;
}
else if (tmpNum == 1) {
relocation with minimum costs among { order(i) == Order(i in NMSi+1), order(i) == Order(i in NMSi)
order(i) when relocating link message i’s row and column
};
updating S(x,y) in sub-matrix from M(i-j) to M(i+j);
break;
}
tmpOrder++;
} }
3.2 Communication models
Data redistribution is a set of routines that transfer all the elements in a set of source
processor S to a set of destination processor T. The sizes of the messages are specified by
values of user-defined random integer for array mapping from source processor to
destination processor. Since node contention considerably influences, a processor can only
send messages to other one processor in each communication step. Use the same rule, a
processor can only receive messages from other one processor.
To simplify the presentation, notations and terminologies used in this thesis are prior
defined as follows.
Definition 1:GEN_BLOCK redistribution on one dimension array A[1:N] over P processors.
The source processor is denoted as SPi. the destination processor is denoted as DPj, where 0
≦ i, j ≦ P-1
Definition 2: The time of redistribution separator the time of startup is denoted as ts, and
the time of communication is denoted as tcomm.
Definition 3 : To satisfy the condition of the minimum steps and the processor
sends/receives one message at each steps, some messages can not be scheduled in the same
communication step are called conflict tuple [22].
Data redistribution implements have two methods: non-blocking scheduling algorithm and
blocking scheduling algorithm. The non-blocking scheduling algorithm is faster than the
blocking scheduling algorithm. But it needs more buffer and control synchronization well.
In this thesis, we discuss on blocking scheduling algorithm.
Irregular data redistribution is unlike regular has a cyclic message passing pattern.
Every message transmission link is not overlapping. Hence, the total number of message links N isnumprocs≤N ≤ ×2 numprocs−1, where numprocs is the number of processors.
Table 3-1 shows an example of redistributing two GEN_BLOCK distributions on an array
A[1:101]. The communications between source and destination processor sets are depicted
in Figure 3-1. There are totally fifteen communication messages, m1, m2, m3…, m15 among
processors involved in the redistribution. In this example, {m2, m3, m4} is a conflict tuple
since they have common source processor SP1; {m7, m8, m9} is also a conflict point because
of the common destination processor DP4. The maximum degree in the example is equal
to 3. Table 3-2 shows a simple schedule for this example.
Table 3-1 An example of source and destination distributions in irregular array redistribution
Source distribution
Source Processor
SP SP0 SP1 SP2 SP3 SP4 SP5 SP6 SP7
Size 12 20 15 14 11 9 9 11
Destination distribution
Destination Processor
DP DP0 DP1 DP2 DP3 DP4 DP5 DP6 DP7
Size 17 10 13 6 17 12 11 15
Figure 3-1 The communications between source and destination processor sets
Table 3-2 A simple scheduling
Schedule Table
Step 1 m2 m5 m9 m12 m14
Step 2 m1 m3 m6 m8 m11 m15
Step 3 m4 m7 m10 m13
3.3 Explicit conflict tuple and implicit conflict tuple
The total communication time of a message passing operation using two parameters:
the startup time ts and the unit data transmission time tm. The startup time is once for each SP0 SP1 SP2 SP3 SP4 SP5 SP6 SP7
DP0 DP1 DP2 DP3 DP4 DP5 DP6 DP7
m1 m3 m5 m7 m9 m11 m13 m15
12 5 10 5 8 6 1 14 2 9 3 6 5 4 11
communication event and is independent of the message size to be communicated. The data
transmission time (tcomm) is relationship of a message size, size(m).
( )
c o m m s m
t = t + s ize m × t (1)
The communication time of one communication step tcomm(stepi)is the maximum of the
message in this step.
( ) m a x ( ( ) )
i
c o m m i c o m m
m s t e p
t s t e p t m
= ∈ (2)
The total communication time of all steps (tcomm(total))is summary of each the
communication time of step. The minimum step: k is proportional to the maximum degree.
0
( ) ( )
c o m m c o m m i
i k
t t o t a l t s t e p
< ≤
=
∑
(3)The length of these steps determines the data transmission overheads. The minimum
step is equal to maximum degree k, when message can not put into any step of minimum
step it must relate to the processor has maximum degree transmission links. Figure 3-2
shows the maximum degree of Figure 3-1. SP1, SP2 and DP4 had maximum degree (K = 3)
from messages m2~m9. Because of each one processor can only send/receive at most one
message to/from other processor in each communication step. First, we concentrate all
the thesis, as shown in Figure 3-2. If the messages in MDMSs can put into k steps with no
conflict occur, other messages of the processors’ degree less than maximum degree will be
easier to put into the rest of step without increasing the number of steps.
We say a message to be an explicit conflict point if it belongs to two MDMSs. There
exists at most one explicit conflict point between two MDMSs. In Figure 3-2, m7 is an
explicit conflict point since it belongs to two MDMSs {m5, m6, m7} and {m7, m8, m9}. On
the other hand, if two MDMSs do not contain the same message, but the neighbor MDMSs
each has a message been sent by the same processor, or been received by the same
processor. We call this kind of message as an implicit conflict point. As shown by Figure
3-3, m4 and m5 are contained by the different MDMSs. DP2 only receives m4 and m5 two
messages, so it can not form an MDMS. But m4 and m5 are also owned by different
MDMSs. Therefore, m4 is an implicit conflict point. Although, m5 is also covered by two
MDMSs, but it is restricted by m4. Hence m5 will not cause conflict. Figure 3-5 depicts all
MDMSs for the example shown in Figure 3-1.
Figure 3-2 Maximum Degree Messages Set
Figure 3-3 Explicit conflict point
Figure 3-4 Implicit conflict point
m2 m3 m4 m5 m6 m7
1 2 4 3 1 2
m5 m6 m7 m8 m9
1 2 4 3 1
m2 m3m4 m5 m6 m7 m8 m9
Figure 3-5 All MDMSs for the example in Figure 3-1
m1 m2 m3
m
4 m5 m6m
7 m8 m9 m10 m11 m12 m13Chapter 4 Scheduling Algorithm
The main goal of irregular array distribution is to minimize communication step as
well as the total message size of steps. The conflict points chosen in the Divide and
Conquer Algorithm [19] are the messages covered by two NMSs. But some of the messages
do not really cause conflict. Therefore, we select the smallest conflict points which will
really cause conflict to loose the schedule constraint and to minimize the total message size
of schedule.
Smallest conflict points algorithm consists of four parts:
(1) Pick out MDMSs from given data redistributed problem.
(2) Find out explicit conflict point and implicit conflict point. And schedule all the conflict
point into the same schedule step.
(3) Select messages on MDMSs in non-increasing order of message size. Schedule message
into similar message size of that step and keep the relation of each processor send/receive at
most one message to/from the processor. Repeat above process until no MDMSs’ messages
left.
(4) Schedule messages do not belong to MDMSs by non-increasing order of message size.
Repeat above process until no messages left.
From Figure 4-1, we can pick out four MDMSs, MDMS1 = {m2, m3, m4}, MDMS2 =
{m4, m5}, MDMS3 = {m5, m6, m7} and MDMS4 = {m7, m8, m9}, shown in Figure 4-1. Then
schedule m4 and m7 into the same step. Then schedule those messages on MDMSs by
non-increasing order of message size as follows: m8, m3, m5, m6, m2, m9. After that, we can
schedule the rest messages that are not belong to any MDMSs by non-increasing order of
message size as follows: m1, m15, m10, m12, m13, m14, m11. Figure 4-2 shows the final
schedule obtained form smallest conflict points algorithm.
The Smallest conflict points algorithm is given as follows.
========================================================
1 Pick out MDMSs Algorithm:
2 {
3 j=0;
4 //numprocs: total number of processors 5 for(i = 0; i < numprocs; i++) { 6 if( degree of i == maxdegree ) 7 {
8 pick i own messages into MDMSj;
9 j++;
11 else 12 {
13 if( i own messages also be owned by two MDMSs ) 14 {
15 pick i own messages into MDMSj; 16 j++;
17 } 18 } 19 } 20}
21
22 Find conflict point algorithm:
23{// m: the count of the MDMS 24 for (i = 0; i < m-1; ) 25 {
26 if( MDMSi has the same message of MDMSi+1 )
27 {
28 this same message is conflict point;
29 i += 2;
30 } 31 else 32 i++
33 } 34}
========================================================
S1: m8 m3 m5 m1 m15 m10 m12
S2: m6 m2 m9 m13 m11
S3: m4 m7 m14
Figure 4-1 Results of MDMSs for Figure 3-1
Figure 4-2 The schedule obtained form SCPA
m2 m3 m4 m5 m6 m7 m8 m9
S1: m8 m3 m5 S2: m6 m2 m9 S3: m4 m7
Chapter 5 Performance Evaluation and Analysis
To evaluate the performance of the proposed methods, we have implemented the SCPA
along with the divide-and-conquer algorithm [22]. The performance simulation is
discussed in two classes, even GEN_BLOCK and uneven GEN_BLOCK distributions. In
even GEN_BLOCK distribution, each processor owns similar size of data. Contrast to even
distribution, few processors might be allocated grand volume of data in uneven distribution.
Since array elements could be centralized to some specific processors, it is also possible for
those processors to have the maximum degree of communications.
The simulation program generates a set of random integer number as the size of
message. Also, we set the number of source processors equals to the number of target
processors in order to avoid some processors do not have message. Moreover, the total
messages size of source processors is equals to the total size of target processors to keep the
balance between source processors and target processors.
To correctly evaluate the performance of these two algorithms, both programs were
written in the single program multiple data (SPMD) programming paradigm with MPI code
and executed on an SMP/Linux cluster consisted of 24 SMP nodes. In the following figures,
processors. Also, in the figures, “SCPA Better” represents the percentage of the number of
events that the SCPA has lower total computation (communication) time than the
divide-and-conquer algorithm, while “DCA Better” gives the reverse situation. If both
algorithms have same total computation (communication) time, “The Same Results”
represents the number of that event. In the uneven distribution, the size of message’s
up-bound is set to (totalsize/numprocs)*1.5 and low-bound is set to
(totalsize/numprocs)*0.3, where totalsize is total size of messages and numprocs is the size
of processor. In the even distribution, the size of message’s up-bound is set to
(totalsize/numprocs)*1.3 and low-bound is set to low-bound is (totalsize/numprocs)*0.7.
The total messages size is 1M.Figure 5-1 to Figure 5-4 show the simulation results of both
the smallest-conflict-points algorithm (SCPA) and the divide-and-conquer distribution
algorithm with different number of processors and total message size. The number of
processors is from 8 to 24. We can observe that SCPA algorithm has better performance on
uneven data redistribution compared with divide-and-conquer algorithm.
From
Figure 5-3 and Figure 5-4 data in the even case, we can observe that SCPA have the
better performance compared with uneven case.
Figure 5-3 and Figure 5-4 also illustrates that SCPA has at least 85% supreme than
divide-and-conquer in any size of total messages and any number of processors. Figure 5-5
to Figure 5-8 depict the communication time between SCPA and Divide-and-Conquer
algorithm against the number of processors and total message size in uneven (even) case.
Because the communication cost (time) is calculated according to the real message
exchange. It depends on the bandwidth, I/O, and other factors in the real execution
environment. In both even and uneven case, SCPA performs slightly better than
Divide-and-Conquer algorithm.
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
10000 20000 30000 40000 50000
Total messages size
Event percentage(%)
The Same Results SCPA Better DCA Better
Figure 5-1 The events percentage of computing is plotted with different number of processors
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
8 12 16 20 24
The number of processors
Event percentage(%)
The Same Results SCPA Better DCA Better
Figure 5-2 The events percentage of computing is plotted with different of total messages size in 8 processors, on uneven data set.
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
8 12 16 20 24
The messages size
Event percentage(%)
The Same Results SCPA Better DCA Better
Figure 5-3 The events percentage of computing is plotted with different number of processors
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
10000 20000 30000 40000 50000
Total messages size
Event percentage(%)
The Same Results SCPA Better DCA Better
Figure 5-4 The events percentage of computing is plotted with different of total messages size in 8 processors, on even data set
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
10000 20000 30000 40000 50000
The total of messages size
Event percentage(%)
SCPA Better DCA Better
Figure 5-5 The events percentage of communication time is plotted with different number of processors uneven data set
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
8 12 16 20 24
The number of processors
Event percentage(%)
SCPA Better DCA Better
Figure 5-6 The events percentage of communication time is plotted with different of total messages size in 8 processors, on uneven data set
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
10000 20000 30000 40000 50000
The total of messages size
Event percentage(%)
SCPA Better DCA Better
Figure 5-7 The events percentage of communication time is plotted with different number of processors
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
8 12 16 20 24
The number of processors
Event percentage(%)
SCPA Better DCA Better
Figure 5-8 The events percentage of communication time is plotted with different of total messages size in 8 processors, on even data set.
Figure 5-9 and Table 5-1 are the maximum degree in different nodes of uneven cases.
Figure 5-10 and Table 5-2 is the maximum degree in different nodes of even cases. These
show that the number of processors is raising, the greater degree is growth. When the
greater degree is growth; the complexity of scheduling is increasing.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Event percentage(%)
4 8 12 16 20 24
the number of processors maximum degree
maxdegree=6 maxdegree=5 maxdegree=4 maxdegree=3 maxdegree=2 maxdegree=1
Figure 5-9 Maximum degree of (3-15) redistribution
Table 5-1 The detail of the maximum degree in Figure 5-9
Node=4 Node=8 Node=12 Node=16 Node=20 Node=24
Maxdegree=1 0 0 0 0 0 0
Maxdegree=2 0 0 0 0 0 0
Maxdegree=3 4479 825 126 15 3 0
Maxdegree=4 5474 8560 8632 8116 7546 6993
Maxdegree=5 47 614 1240 1862 2434 2988
Maxdegree=6 0 1 2 7 17 19
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Event percentage(%)
4 8 12 16 20 24
the number of processors maxiimum degree
maxdegree=6 maxdegree=5 maxdegree=4 maxdegree=3 maxdegree=2 maxdegree=1
Figure 5-10 Maximum degree of (7-13) redistribution
Table 5-2 The detail of the maximum degree in Figure 5-10
Node=4 Node=8 Node=12 Node=16 Node=20 Node=24
Maxdegree=1 0 0 0 0 0 0
Maxdegree=2 0 0 0 0 0 0
Maxdegree=3 5009 2425 1337 739 408 213
Maxdegree=4 4991 7575 8663 9261 9592 9787
Maxdegree=5 0 0 0 0 0 0
Chapter 6 Conclusion
In this thesis, we have presented an efficient scheduling algorithm, smallest conflict
points algorithm (SCPA), for irregular data distribution. The algorithm used the smallest
conflict points, let scheduling more elasticity. First, get the maximum degree messages set
(MDMS). Second, pick out the explicit conflict points and implicit conflict points. Then put
conflict points in the same step. Third, put the messages in the MDMSs by decreasing the
length of message. Finally, put the messages in non-MDMSs by the same method of third
step. The algorithm can effectively reduce communication time in the process of data
redistribution. Smallest-conflict-points algorithm is not only an optimal algorithm in the
term of minimal number of steps, but also a near optimal algorithm satisfied the condition
of minimal message size of total steps. Effectiveness of the proposed methods not only
avoids node contention but also shortens the overall communication length.
From the experiments, SCPA had the least 80% better than DCA in uneven case. And
SCPA had the least 89% better than DCA in even case. When the number of processors is
growth, the simulation results SCPA better than DCA.
For verifying the performance of our proposed algorithm, we have implemented SCPA
as well as the divide-and-conquer redistribution algorithm. The experimental results show
SCPA either communication costs or simulation results both are better than DCA. Also, the
experimental results indicate that both of them have good performance on GEN_BLOCK
redistribution. In the most cases, SCPA has better performance than the divide-and-conquer
redistribution algorithm.
Reference
[1] G. Bandera and E.L. Zapata, “Sparse Matrix Block-Cyclic Redistribution,” Proceeding of IEEE Int'l. Parallel Processing Symposium (IPPS'99), San Juan, Puerto Rico, April 1999 Page(s):355 - 359.
[2] Frederic Desprez, Jack Dongarra and Antoine Petitet, “Scheduling Block-Cyclic Data redistribution,” IEEE Trans. on PDS, vol. 9, no. 2, pp. 192-205, Feb. 1998.
[3] Minyi Guo, “Communication Generation for Irregular Codes,” The Journal of Supercomputing, vol. 25, no. 3, pp. 199-214, 2003.
[4] Minyi Guo and I. Nakata, “A Framework for Efficient Array Redistribution on Distributed Memory Multicomputers,” The Journal of Supercomputing, vol. 20, no. 3, pp. 243-265, 2001.
[5] Minyi Guo, I. Nakata and Y. Yamashita, “Contention-Free Communication Scheduling for Array Redistribution,” Parallel Computing, vol. 26, no.8, pp. 1325-1343, 2000.
[6] Minyi Guo, I. Nakata and Y. Yamashita, “An Efficient Data Distribution Technique for Distributed Memory Parallel Computers,” JSPP'97, pp.189-196, 1997.
[7] Minyi Guo, Yi Pan and Zhen Liu, “Symbolic Communication Set Generation for Irregular Parallel Applications,” The Journal of Supercomputing, vol. 25, pp. 199-214, 2003.
[8] Edgar T. Kalns, and Lionel M. Ni, “Processor Mapping Technique Toward Efficient Data Redistribution,” IEEE Trans. on PDS, vol. 6, no. 12, pp. 1234-1247, December
[9] S. D. Kaushik, C. H. Huang, J. Ramanujam and P. Sadayappan, “Multiphase data redistribution: Modeling and evaluation,” Proceeding of IPPS’95, pp. 441-445, 1995.
[10] Peizong Lee, Academia Sinica, and Zvi Meir Kedem, “Automatic Data and Computation Decomposition on Distributed Memory Parallel Computers,” ACM Transactions on Programming Languages and systems, Vol 24, No. 1, pp. 1-50, January 2002.
[11] S. Lee, H. Yook, M. Koo and M. Park, “Processor reordering algorithms toward efficient GEN_BLOCK redistribution,” Proceedings of the ACM symposium on Applied computing, pp . 539-543, 2001.
[12] Y. W. Lim, Prashanth B. Bhat and Viktor and K. Prasanna, “Efficient Algorithms for Block-Cyclic Redistribution of Arrays,” Algorithmica, vol. 24, no. 3-4, pp. 298-330, 1999.
[13] C.-H Hsu, S.-W Bai, Y.-C Chung and C.-S Yang, “A Generalized Basic-Cycle Calculation Method for Efficient Array Redistribution,” IEEE TPDS, vol. 11, no. 12, pp. 1201-1216, Dec. 2000.
[14] Ching-Hsien Hsu, Kun-Ming Yu, “An Optimal Processor Replacement Scheme for Efficient Communication of Runtime Data Realignment,” pp. 268-273, 2004.
[15] C.-H Hsu, Dong-Lin Yang, Yeh-Ching Chung and Chyi-Ren Dow, “A Generalized Processor Mapping Technique for Array Redistribution,” IEEE Transactions on Parallel and Distributed Systems, vol. 12, vol. 7, pp. 743-757, July 2001.
[16] Antoine P. Petitet and Jack J. Dongarra, “Algorithmic Redistribution Methods for Block-Cyclic Decompositions,” IEEE Trans. on PDS, vol. 10, no. 12, pp. 1201-1216, Dec. 1999
[17] Neungsoo Park, Viktor K. Prasanna and Cauligi S. Raghavendra, “Efficient Algorithms for Block-Cyclic Data redistribution Between Processor Sets,” IEEE Transactions on Parallel and Distributed Systems, vol. 10, No. 12, pp.1217-1240, Dec.
1999.
[18] .L. Prylli and B. Touranchean, “Fast runtime block cyclic data redistribution on multiprocessors,” Journal of Parallel and Distributed Computing, vol. 45, pp. 63-72, Aug. 1997.
[19] S. Ramaswamy, B. Simons, and P. Banerjee, “Optimization for Efficient Data redistribution on Distributed Memory Multicomputers,” Journal of Parallel and Distributed Computing, vol. 38, pp. 217-228, 1996.
[20] Akiyoshi Wakatani and Michael Wolfe, “Optimization of Data redistribution for Distributed Memory Multicomputers,” short communication, Parallel Computing, vol.
21, no. 9, pp. 1485-1490, September 1995.
[21] Hui Wang, Minyi Guo and Wenxi Chen, “An Efficient Algorithm for Irregular Redistribution in Parallelizing Compilers,” Proceedings of 2003 International Symposium on Parallel and Distributed Processing with Applications, LNCS 2745, 2003.
[22] Hui Wang, Minyi Guo and Daming Wei, "Divide-and-conquer Algorithm for Irregular Redistributions in Parallelizing Compilers”, The Journal of Supercomputing, vol. 29, no. 2, pp. 157-170, 2004.
[23] H.-G. Yook and Myung-Soon Park, “Scheduling GEN_BLOCK Array Redistribution,”
Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems, November, 1999.