The main goal of irregular array distribution is to minimize communication step as
well as the total message size of steps. The conflict points chosen in the Divide and
Conquer Algorithm [19] are the messages covered by two NMSs. But some of the messages
do not really cause conflict. Therefore, we select the smallest conflict points which will
really cause conflict to loose the schedule constraint and to minimize the total message size
of schedule.
Smallest conflict points algorithm consists of four parts:
(1) Pick out MDMSs from given data redistributed problem.
(2) Find out explicit conflict point and implicit conflict point. And schedule all the conflict
point into the same schedule step.
(3) Select messages on MDMSs in non-increasing order of message size. Schedule message
into similar message size of that step and keep the relation of each processor send/receive at
most one message to/from the processor. Repeat above process until no MDMSs’ messages
left.
(4) Schedule messages do not belong to MDMSs by non-increasing order of message size.
Repeat above process until no messages left.
From Figure 4-1, we can pick out four MDMSs, MDMS1 = {m2, m3, m4}, MDMS2 =
{m4, m5}, MDMS3 = {m5, m6, m7} and MDMS4 = {m7, m8, m9}, shown in Figure 4-1. Then
schedule m4 and m7 into the same step. Then schedule those messages on MDMSs by
non-increasing order of message size as follows: m8, m3, m5, m6, m2, m9. After that, we can
schedule the rest messages that are not belong to any MDMSs by non-increasing order of
message size as follows: m1, m15, m10, m12, m13, m14, m11. Figure 4-2 shows the final
schedule obtained form smallest conflict points algorithm.
The Smallest conflict points algorithm is given as follows.
========================================================
1 Pick out MDMSs Algorithm:
2 {
3 j=0;
4 //numprocs: total number of processors 5 for(i = 0; i < numprocs; i++) { 6 if( degree of i == maxdegree ) 7 {
8 pick i own messages into MDMSj;
9 j++;
11 else 12 {
13 if( i own messages also be owned by two MDMSs ) 14 {
15 pick i own messages into MDMSj; 16 j++;
17 } 18 } 19 } 20}
21
22 Find conflict point algorithm:
23{// m: the count of the MDMS 24 for (i = 0; i < m-1; ) 25 {
26 if( MDMSi has the same message of MDMSi+1 )
27 {
28 this same message is conflict point;
29 i += 2;
30 } 31 else 32 i++
33 } 34}
========================================================
S1: m8 m3 m5 m1 m15 m10 m12
S2: m6 m2 m9 m13 m11
S3: m4 m7 m14
Figure 4-1 Results of MDMSs for Figure 3-1
Figure 4-2 The schedule obtained form SCPA
m2 m3 m4 m5 m6 m7 m8 m9
S1: m8 m3 m5 S2: m6 m2 m9 S3: m4 m7
Chapter 5 Performance Evaluation and Analysis
To evaluate the performance of the proposed methods, we have implemented the SCPA
along with the divide-and-conquer algorithm [22]. The performance simulation is
discussed in two classes, even GEN_BLOCK and uneven GEN_BLOCK distributions. In
even GEN_BLOCK distribution, each processor owns similar size of data. Contrast to even
distribution, few processors might be allocated grand volume of data in uneven distribution.
Since array elements could be centralized to some specific processors, it is also possible for
those processors to have the maximum degree of communications.
The simulation program generates a set of random integer number as the size of
message. Also, we set the number of source processors equals to the number of target
processors in order to avoid some processors do not have message. Moreover, the total
messages size of source processors is equals to the total size of target processors to keep the
balance between source processors and target processors.
To correctly evaluate the performance of these two algorithms, both programs were
written in the single program multiple data (SPMD) programming paradigm with MPI code
and executed on an SMP/Linux cluster consisted of 24 SMP nodes. In the following figures,
processors. Also, in the figures, “SCPA Better” represents the percentage of the number of
events that the SCPA has lower total computation (communication) time than the
divide-and-conquer algorithm, while “DCA Better” gives the reverse situation. If both
algorithms have same total computation (communication) time, “The Same Results”
represents the number of that event. In the uneven distribution, the size of message’s
up-bound is set to (totalsize/numprocs)*1.5 and low-bound is set to
(totalsize/numprocs)*0.3, where totalsize is total size of messages and numprocs is the size
of processor. In the even distribution, the size of message’s up-bound is set to
(totalsize/numprocs)*1.3 and low-bound is set to low-bound is (totalsize/numprocs)*0.7.
The total messages size is 1M.Figure 5-1 to Figure 5-4 show the simulation results of both
the smallest-conflict-points algorithm (SCPA) and the divide-and-conquer distribution
algorithm with different number of processors and total message size. The number of
processors is from 8 to 24. We can observe that SCPA algorithm has better performance on
uneven data redistribution compared with divide-and-conquer algorithm.
From
Figure 5-3 and Figure 5-4 data in the even case, we can observe that SCPA have the
better performance compared with uneven case.
Figure 5-3 and Figure 5-4 also illustrates that SCPA has at least 85% supreme than
divide-and-conquer in any size of total messages and any number of processors. Figure 5-5
to Figure 5-8 depict the communication time between SCPA and Divide-and-Conquer
algorithm against the number of processors and total message size in uneven (even) case.
Because the communication cost (time) is calculated according to the real message
exchange. It depends on the bandwidth, I/O, and other factors in the real execution
environment. In both even and uneven case, SCPA performs slightly better than
Divide-and-Conquer algorithm.
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
10000 20000 30000 40000 50000
Total messages size
Event percentage(%)
The Same Results SCPA Better DCA Better
Figure 5-1 The events percentage of computing is plotted with different number of processors
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
8 12 16 20 24
The number of processors
Event percentage(%)
The Same Results SCPA Better DCA Better
Figure 5-2 The events percentage of computing is plotted with different of total messages size in 8 processors, on uneven data set.
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
8 12 16 20 24
The messages size
Event percentage(%)
The Same Results SCPA Better DCA Better
Figure 5-3 The events percentage of computing is plotted with different number of processors
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
10000 20000 30000 40000 50000
Total messages size
Event percentage(%)
The Same Results SCPA Better DCA Better
Figure 5-4 The events percentage of computing is plotted with different of total messages size in 8 processors, on even data set
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
10000 20000 30000 40000 50000
The total of messages size
Event percentage(%)
SCPA Better DCA Better
Figure 5-5 The events percentage of communication time is plotted with different number of processors uneven data set
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
8 12 16 20 24
The number of processors
Event percentage(%)
SCPA Better DCA Better
Figure 5-6 The events percentage of communication time is plotted with different of total messages size in 8 processors, on uneven data set
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
10000 20000 30000 40000 50000
The total of messages size
Event percentage(%)
SCPA Better DCA Better
Figure 5-7 The events percentage of communication time is plotted with different number of processors
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
8 12 16 20 24
The number of processors
Event percentage(%)
SCPA Better DCA Better
Figure 5-8 The events percentage of communication time is plotted with different of total messages size in 8 processors, on even data set.
Figure 5-9 and Table 5-1 are the maximum degree in different nodes of uneven cases.
Figure 5-10 and Table 5-2 is the maximum degree in different nodes of even cases. These
show that the number of processors is raising, the greater degree is growth. When the
greater degree is growth; the complexity of scheduling is increasing.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Event percentage(%)
4 8 12 16 20 24
the number of processors maximum degree
maxdegree=6 maxdegree=5 maxdegree=4 maxdegree=3 maxdegree=2 maxdegree=1
Figure 5-9 Maximum degree of (3-15) redistribution
Table 5-1 The detail of the maximum degree in Figure 5-9
Node=4 Node=8 Node=12 Node=16 Node=20 Node=24
Maxdegree=1 0 0 0 0 0 0
Maxdegree=2 0 0 0 0 0 0
Maxdegree=3 4479 825 126 15 3 0
Maxdegree=4 5474 8560 8632 8116 7546 6993
Maxdegree=5 47 614 1240 1862 2434 2988
Maxdegree=6 0 1 2 7 17 19
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Event percentage(%)
4 8 12 16 20 24
the number of processors maxiimum degree
maxdegree=6 maxdegree=5 maxdegree=4 maxdegree=3 maxdegree=2 maxdegree=1
Figure 5-10 Maximum degree of (7-13) redistribution
Table 5-2 The detail of the maximum degree in Figure 5-10
Node=4 Node=8 Node=12 Node=16 Node=20 Node=24
Maxdegree=1 0 0 0 0 0 0
Maxdegree=2 0 0 0 0 0 0
Maxdegree=3 5009 2425 1337 739 408 213
Maxdegree=4 4991 7575 8663 9261 9592 9787
Maxdegree=5 0 0 0 0 0 0
Chapter 6 Conclusion
In this thesis, we have presented an efficient scheduling algorithm, smallest conflict
points algorithm (SCPA), for irregular data distribution. The algorithm used the smallest
conflict points, let scheduling more elasticity. First, get the maximum degree messages set
(MDMS). Second, pick out the explicit conflict points and implicit conflict points. Then put
conflict points in the same step. Third, put the messages in the MDMSs by decreasing the
length of message. Finally, put the messages in non-MDMSs by the same method of third
step. The algorithm can effectively reduce communication time in the process of data
redistribution. Smallest-conflict-points algorithm is not only an optimal algorithm in the
term of minimal number of steps, but also a near optimal algorithm satisfied the condition
of minimal message size of total steps. Effectiveness of the proposed methods not only
avoids node contention but also shortens the overall communication length.
From the experiments, SCPA had the least 80% better than DCA in uneven case. And
SCPA had the least 89% better than DCA in even case. When the number of processors is
growth, the simulation results SCPA better than DCA.
For verifying the performance of our proposed algorithm, we have implemented SCPA
as well as the divide-and-conquer redistribution algorithm. The experimental results show
SCPA either communication costs or simulation results both are better than DCA. Also, the
experimental results indicate that both of them have good performance on GEN_BLOCK
redistribution. In the most cases, SCPA has better performance than the divide-and-conquer
redistribution algorithm.
Reference
[1] G. Bandera and E.L. Zapata, “Sparse Matrix Block-Cyclic Redistribution,” Proceeding of IEEE Int'l. Parallel Processing Symposium (IPPS'99), San Juan, Puerto Rico, April 1999 Page(s):355 - 359.
[2] Frederic Desprez, Jack Dongarra and Antoine Petitet, “Scheduling Block-Cyclic Data redistribution,” IEEE Trans. on PDS, vol. 9, no. 2, pp. 192-205, Feb. 1998.
[3] Minyi Guo, “Communication Generation for Irregular Codes,” The Journal of Supercomputing, vol. 25, no. 3, pp. 199-214, 2003.
[4] Minyi Guo and I. Nakata, “A Framework for Efficient Array Redistribution on Distributed Memory Multicomputers,” The Journal of Supercomputing, vol. 20, no. 3, pp. 243-265, 2001.
[5] Minyi Guo, I. Nakata and Y. Yamashita, “Contention-Free Communication Scheduling for Array Redistribution,” Parallel Computing, vol. 26, no.8, pp. 1325-1343, 2000.
[6] Minyi Guo, I. Nakata and Y. Yamashita, “An Efficient Data Distribution Technique for Distributed Memory Parallel Computers,” JSPP'97, pp.189-196, 1997.
[7] Minyi Guo, Yi Pan and Zhen Liu, “Symbolic Communication Set Generation for Irregular Parallel Applications,” The Journal of Supercomputing, vol. 25, pp. 199-214, 2003.
[8] Edgar T. Kalns, and Lionel M. Ni, “Processor Mapping Technique Toward Efficient Data Redistribution,” IEEE Trans. on PDS, vol. 6, no. 12, pp. 1234-1247, December
[9] S. D. Kaushik, C. H. Huang, J. Ramanujam and P. Sadayappan, “Multiphase data redistribution: Modeling and evaluation,” Proceeding of IPPS’95, pp. 441-445, 1995.
[10] Peizong Lee, Academia Sinica, and Zvi Meir Kedem, “Automatic Data and Computation Decomposition on Distributed Memory Parallel Computers,” ACM Transactions on Programming Languages and systems, Vol 24, No. 1, pp. 1-50, January 2002.
[11] S. Lee, H. Yook, M. Koo and M. Park, “Processor reordering algorithms toward efficient GEN_BLOCK redistribution,” Proceedings of the ACM symposium on Applied computing, pp . 539-543, 2001.
[12] Y. W. Lim, Prashanth B. Bhat and Viktor and K. Prasanna, “Efficient Algorithms for Block-Cyclic Redistribution of Arrays,” Algorithmica, vol. 24, no. 3-4, pp. 298-330, 1999.
[13] C.-H Hsu, S.-W Bai, Y.-C Chung and C.-S Yang, “A Generalized Basic-Cycle Calculation Method for Efficient Array Redistribution,” IEEE TPDS, vol. 11, no. 12, pp. 1201-1216, Dec. 2000.
[14] Ching-Hsien Hsu, Kun-Ming Yu, “An Optimal Processor Replacement Scheme for Efficient Communication of Runtime Data Realignment,” pp. 268-273, 2004.
[15] C.-H Hsu, Dong-Lin Yang, Yeh-Ching Chung and Chyi-Ren Dow, “A Generalized Processor Mapping Technique for Array Redistribution,” IEEE Transactions on Parallel and Distributed Systems, vol. 12, vol. 7, pp. 743-757, July 2001.
[16] Antoine P. Petitet and Jack J. Dongarra, “Algorithmic Redistribution Methods for Block-Cyclic Decompositions,” IEEE Trans. on PDS, vol. 10, no. 12, pp. 1201-1216, Dec. 1999
[17] Neungsoo Park, Viktor K. Prasanna and Cauligi S. Raghavendra, “Efficient Algorithms for Block-Cyclic Data redistribution Between Processor Sets,” IEEE Transactions on Parallel and Distributed Systems, vol. 10, No. 12, pp.1217-1240, Dec.
1999.
[18] .L. Prylli and B. Touranchean, “Fast runtime block cyclic data redistribution on multiprocessors,” Journal of Parallel and Distributed Computing, vol. 45, pp. 63-72, Aug. 1997.
[19] S. Ramaswamy, B. Simons, and P. Banerjee, “Optimization for Efficient Data redistribution on Distributed Memory Multicomputers,” Journal of Parallel and Distributed Computing, vol. 38, pp. 217-228, 1996.
[20] Akiyoshi Wakatani and Michael Wolfe, “Optimization of Data redistribution for Distributed Memory Multicomputers,” short communication, Parallel Computing, vol.
21, no. 9, pp. 1485-1490, September 1995.
[21] Hui Wang, Minyi Guo and Wenxi Chen, “An Efficient Algorithm for Irregular Redistribution in Parallelizing Compilers,” Proceedings of 2003 International Symposium on Parallel and Distributed Processing with Applications, LNCS 2745, 2003.