Scheduling Algorithm - t OGTu{thZ mWGM092020035 G [ h EQ| ~ C An Efficient Communication Schedu

The main goal of irregular array distribution is to minimize communication step as

well as the total message size of steps. The conflict points chosen in the Divide and

Conquer Algorithm [19] are the messages covered by two NMSs. But some of the messages

do not really cause conflict. Therefore, we select the smallest conflict points which will

really cause conflict to loose the schedule constraint and to minimize the total message size

of schedule.

Smallest conflict points algorithm consists of four parts:

(1) Pick out MDMSs from given data redistributed problem.

(2) Find out explicit conflict point and implicit conflict point. And schedule all the conflict

point into the same schedule step.

(3) Select messages on MDMSs in non-increasing order of message size. Schedule message

into similar message size of that step and keep the relation of each processor send/receive at

most one message to/from the processor. Repeat above process until no MDMSs’ messages

left.

(4) Schedule messages do not belong to MDMSs by non-increasing order of message size.

Repeat above process until no messages left.

From Figure 4-1, we can pick out four MDMSs, MDMS₁ = {m₂, m₃, m₄}, MDMS₂=

{m₄, m₅}, MDMS₃= {m₅, m₆, m₇} and MDMS₄= {m₇, m₈, m₉}, shown in Figure 4-1. Then

schedule m₄ and m₇ into the same step. Then schedule those messages on MDMSs by

non-increasing order of message size as follows: m₈, m₃, m₅, m₆, m₂, m₉. After that, we can

schedule the rest messages that are not belong to any MDMSs by non-increasing order of

message size as follows: m₁, m₁₅, m₁₀, m₁₂, m₁₃, m_14, m₁₁. Figure 4-2 shows the final

schedule obtained form smallest conflict points algorithm.

The Smallest conflict points algorithm is given as follows.

========================================================

1 Pick out MDMSs Algorithm:

2 {

3 j=0;

4 //numprocs: total number of processors 5 for(i = 0; i < numprocs; i++) { 6 if( degree of i == maxdegree ) 7 {

8 pick i own messages into MDMS_j;

9 j++;

11 else 12 {

13 if( i own messages also be owned by two MDMSs ) 14 {

15 pick i own messages into MDMS_j; 16 j++;

17 } 18 } 19 } 20}

22 Find conflict point algorithm:

23{// m: the count of the MDMS 24 for (i = 0; i < m-1; ) 25 {

26 if( MDMS_i has the same message of MDMS_i+1 )

27 {

28 this same message is conflict point;

29 i += 2;

30 } 31 else 32 i++

33 } 34}

========================================================

S1: m8 m3 m5 m1 m15 m10 m12

S2: m6 m2 m9 m13 m11

S3: m4 m7 m14

Figure 4-1 Results of MDMSs for Figure 3-1

Figure 4-2 The schedule obtained form SCPA

m₂ m₃ m₄ m₅ m₆ m₇ m₈ m₉

S₁: m₈ m₃ m₅ S₂: m₆ m₂ m₉ S₃: m₄m₇

Chapter 5 Performance Evaluation and Analysis

To evaluate the performance of the proposed methods, we have implemented the SCPA

along with the divide-and-conquer algorithm [22]. The performance simulation is

discussed in two classes, even GEN_BLOCK and uneven GEN_BLOCK distributions. In

even GEN_BLOCK distribution, each processor owns similar size of data. Contrast to even

distribution, few processors might be allocated grand volume of data in uneven distribution.

Since array elements could be centralized to some specific processors, it is also possible for

those processors to have the maximum degree of communications.

The simulation program generates a set of random integer number as the size of

message. Also, we set the number of source processors equals to the number of target

processors in order to avoid some processors do not have message. Moreover, the total

messages size of source processors is equals to the total size of target processors to keep the

balance between source processors and target processors.

To correctly evaluate the performance of these two algorithms, both programs were

written in the single program multiple data (SPMD) programming paradigm with MPI code

and executed on an SMP/Linux cluster consisted of 24 SMP nodes. In the following figures,

processors. Also, in the figures, “SCPA Better” represents the percentage of the number of

events that the SCPA has lower total computation (communication) time than the

divide-and-conquer algorithm, while “DCA Better” gives the reverse situation. If both

algorithms have same total computation (communication) time, “The Same Results”

represents the number of that event. In the uneven distribution, the size of message’s

up-bound is set to (totalsize/numprocs)*1.5 and low-bound is set to

(totalsize/numprocs)*0.3, where totalsize is total size of messages and numprocs is the size

of processor. In the even distribution, the size of message’s up-bound is set to

(totalsize/numprocs)*1.3 and low-bound is set to low-bound is (totalsize/numprocs)*0.7.

The total messages size is 1M.Figure 5-1 to Figure 5-4 show the simulation results of both

the smallest-conflict-points algorithm (SCPA) and the divide-and-conquer distribution

algorithm with different number of processors and total message size. The number of

processors is from 8 to 24. We can observe that SCPA algorithm has better performance on

uneven data redistribution compared with divide-and-conquer algorithm.

From

Figure 5-3 and Figure 5-4 data in the even case, we can observe that SCPA have the

better performance compared with uneven case.

Figure 5-3 and Figure 5-4 also illustrates that SCPA has at least 85% supreme than

divide-and-conquer in any size of total messages and any number of processors. Figure 5-5

to Figure 5-8 depict the communication time between SCPA and Divide-and-Conquer

algorithm against the number of processors and total message size in uneven (even) case.

Because the communication cost (time) is calculated according to the real message

exchange. It depends on the bandwidth, I/O, and other factors in the real execution

environment. In both even and uneven case, SCPA performs slightly better than

Divide-and-Conquer algorithm.

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

10000 20000 30000 40000 50000

Total messages size

Event percentage(%)

The Same Results SCPA Better DCA Better

Figure 5-1 The events percentage of computing is plotted with different number of processors

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

8 12 16 20 24

The number of processors

Event percentage(%)

The Same Results SCPA Better DCA Better

Figure 5-2 The events percentage of computing is plotted with different of total messages size in 8 processors, on uneven data set.

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

8 12 16 20 24

The messages size

Event percentage(%)

The Same Results SCPA Better DCA Better

Figure 5-3 The events percentage of computing is plotted with different number of processors

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

10000 20000 30000 40000 50000

Total messages size

Event percentage(%)

The Same Results SCPA Better DCA Better

Figure 5-4 The events percentage of computing is plotted with different of total messages size in 8 processors, on even data set

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

10000 20000 30000 40000 50000

The total of messages size

Event percentage(%)

SCPA Better DCA Better

Figure 5-5 The events percentage of communication time is plotted with different number of processors uneven data set

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

8 12 16 20 24

The number of processors

Event percentage(%)

SCPA Better DCA Better

Figure 5-6 The events percentage of communication time is plotted with different of total messages size in 8 processors, on uneven data set

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

10000 20000 30000 40000 50000

The total of messages size

Event percentage(%)

SCPA Better DCA Better

Figure 5-7 The events percentage of communication time is plotted with different number of processors

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

8 12 16 20 24

The number of processors

Event percentage(%)

SCPA Better DCA Better

Figure 5-8 The events percentage of communication time is plotted with different of total messages size in 8 processors, on even data set.

Figure 5-9 and Table 5-1 are the maximum degree in different nodes of uneven cases.

Figure 5-10 and Table 5-2 is the maximum degree in different nodes of even cases. These

show that the number of processors is raising, the greater degree is growth. When the

greater degree is growth; the complexity of scheduling is increasing.

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Event percentage(%)

4 8 12 16 20 24

the number of processors maximum degree

maxdegree=6 maxdegree=5 maxdegree=4 maxdegree=3 maxdegree=2 maxdegree=1

Figure 5-9 Maximum degree of (3-15) redistribution

Table 5-1 The detail of the maximum degree in Figure 5-9

Node=4 Node=8 Node=12 Node=16 Node=20 Node=24

Maxdegree=1 0 0 0 0 0 0

Maxdegree=2 0 0 0 0 0 0

Maxdegree=3 4479 825 126 15 3 0

Maxdegree=4 5474 8560 8632 8116 7546 6993

Maxdegree=5 47 614 1240 1862 2434 2988

Maxdegree=6 0 1 2 7 17 19

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Event percentage(%)

4 8 12 16 20 24

the number of processors maxiimum degree

maxdegree=6 maxdegree=5 maxdegree=4 maxdegree=3 maxdegree=2 maxdegree=1

Figure 5-10 Maximum degree of (7-13) redistribution

Table 5-2 The detail of the maximum degree in Figure 5-10

Node=4 Node=8 Node=12 Node=16 Node=20 Node=24

Maxdegree=1 0 0 0 0 0 0

Maxdegree=2 0 0 0 0 0 0

Maxdegree=3 5009 2425 1337 739 408 213

Maxdegree=4 4991 7575 8663 9261 9592 9787

Maxdegree=5 0 0 0 0 0 0

Chapter 6 Conclusion

In this thesis, we have presented an efficient scheduling algorithm, smallest conflict

points algorithm (SCPA), for irregular data distribution. The algorithm used the smallest

conflict points, let scheduling more elasticity. First, get the maximum degree messages set

(MDMS). Second, pick out the explicit conflict points and implicit conflict points. Then put

conflict points in the same step. Third, put the messages in the MDMSs by decreasing the

length of message. Finally, put the messages in non-MDMSs by the same method of third

step. The algorithm can effectively reduce communication time in the process of data

redistribution. Smallest-conflict-points algorithm is not only an optimal algorithm in the

term of minimal number of steps, but also a near optimal algorithm satisfied the condition

of minimal message size of total steps. Effectiveness of the proposed methods not only

avoids node contention but also shortens the overall communication length.

From the experiments, SCPA had the least 80% better than DCA in uneven case. And

SCPA had the least 89% better than DCA in even case. When the number of processors is

growth, the simulation results SCPA better than DCA.

For verifying the performance of our proposed algorithm, we have implemented SCPA

as well as the divide-and-conquer redistribution algorithm. The experimental results show

SCPA either communication costs or simulation results both are better than DCA. Also, the

experimental results indicate that both of them have good performance on GEN_BLOCK

redistribution. In the most cases, SCPA has better performance than the divide-and-conquer

redistribution algorithm.

Reference

[1] G. Bandera and E.L. Zapata, “Sparse Matrix Block-Cyclic Redistribution,” Proceeding of IEEE Int'l. Parallel Processing Symposium (IPPS'99), San Juan, Puerto Rico, April 1999 Page(s):355 - 359.

[2] Frederic Desprez, Jack Dongarra and Antoine Petitet, “Scheduling Block-Cyclic Data redistribution,” IEEE Trans. on PDS, vol. 9, no. 2, pp. 192-205, Feb. 1998.

[3] Minyi Guo, “Communication Generation for Irregular Codes,” The Journal of Supercomputing, vol. 25, no. 3, pp. 199-214, 2003.

[4] Minyi Guo and I. Nakata, “A Framework for Efficient Array Redistribution on Distributed Memory Multicomputers,” The Journal of Supercomputing, vol. 20, no. 3, pp. 243-265, 2001.

[5] Minyi Guo, I. Nakata and Y. Yamashita, “Contention-Free Communication Scheduling for Array Redistribution,” Parallel Computing, vol. 26, no.8, pp. 1325-1343, 2000.

[6] Minyi Guo, I. Nakata and Y. Yamashita, “An Efficient Data Distribution Technique for Distributed Memory Parallel Computers,” JSPP'97, pp.189-196, 1997.

[7] Minyi Guo, Yi Pan and Zhen Liu, “Symbolic Communication Set Generation for Irregular Parallel Applications,” The Journal of Supercomputing, vol. 25, pp. 199-214, 2003.

[8] Edgar T. Kalns, and Lionel M. Ni, “Processor Mapping Technique Toward Efficient Data Redistribution,” IEEE Trans. on PDS, vol. 6, no. 12, pp. 1234-1247, December

[9] S. D. Kaushik, C. H. Huang, J. Ramanujam and P. Sadayappan, “Multiphase data redistribution: Modeling and evaluation,” Proceeding of IPPS’95, pp. 441-445, 1995.

[10] Peizong Lee, Academia Sinica, and Zvi Meir Kedem, “Automatic Data and Computation Decomposition on Distributed Memory Parallel Computers,” ACM Transactions on Programming Languages and systems, Vol 24, No. 1, pp. 1-50, January 2002.

[11] S. Lee, H. Yook, M. Koo and M. Park, “Processor reordering algorithms toward efficient GEN_BLOCK redistribution,” Proceedings of the ACM symposium on Applied computing, pp . 539-543, 2001.

[12] Y. W. Lim, Prashanth B. Bhat and Viktor and K. Prasanna, “Efficient Algorithms for Block-Cyclic Redistribution of Arrays,” Algorithmica, vol. 24, no. 3-4, pp. 298-330, 1999.

[13] C.-H Hsu, S.-W Bai, Y.-C Chung and C.-S Yang, “A Generalized Basic-Cycle Calculation Method for Efficient Array Redistribution,” IEEE TPDS, vol. 11, no. 12, pp. 1201-1216, Dec. 2000.

[14] Ching-Hsien Hsu, Kun-Ming Yu, “An Optimal Processor Replacement Scheme for Efficient Communication of Runtime Data Realignment,” pp. 268-273, 2004.

[15] C.-H Hsu, Dong-Lin Yang, Yeh-Ching Chung and Chyi-Ren Dow, “A Generalized Processor Mapping Technique for Array Redistribution,” IEEE Transactions on Parallel and Distributed Systems, vol. 12, vol. 7, pp. 743-757, July 2001.

[16] Antoine P. Petitet and Jack J. Dongarra, “Algorithmic Redistribution Methods for Block-Cyclic Decompositions,” IEEE Trans. on PDS, vol. 10, no. 12, pp. 1201-1216, Dec. 1999

[17] Neungsoo Park, Viktor K. Prasanna and Cauligi S. Raghavendra, “Efficient Algorithms for Block-Cyclic Data redistribution Between Processor Sets,” IEEE Transactions on Parallel and Distributed Systems, vol. 10, No. 12, pp.1217-1240, Dec.

1999.

[18] .L. Prylli and B. Touranchean, “Fast runtime block cyclic data redistribution on multiprocessors,” Journal of Parallel and Distributed Computing, vol. 45, pp. 63-72, Aug. 1997.

[19] S. Ramaswamy, B. Simons, and P. Banerjee, “Optimization for Efficient Data redistribution on Distributed Memory Multicomputers,” Journal of Parallel and Distributed Computing, vol. 38, pp. 217-228, 1996.

[20] Akiyoshi Wakatani and Michael Wolfe, “Optimization of Data redistribution for Distributed Memory Multicomputers,” short communication, Parallel Computing, vol.

21, no. 9, pp. 1485-1490, September 1995.

[21] Hui Wang, Minyi Guo and Wenxi Chen, “An Efficient Algorithm for Irregular Redistribution in Parallelizing Compilers,” Proceedings of 2003 International Symposium on Parallel and Distributed Processing with Applications, LNCS 2745, 2003.

在文檔中 t OGTu{thZ mWGM092020035 G [ h EQ| ~ C An Efficient Communication Scheduling for Contention-Free Irregular Data Redistribution DGbWhstqT{ h j (頁 31-49)