Families of Parallel Prefix Algorithms for Multicomputers

(1)

國立台灣科技大學

資訊工程系

National Taiwan University of Science and Technology

Department of Computer Science and Information Engineering

Families of Parallel Prefix Algorithms for Multicomputers

林彥君 Yen-Chun Lin 洪麗玲 Li-Ling Hung

Technical Report: NTUST-CSIE-07-03

June 2007

(2)

Abstract

Four families of parallel prefix algorithms for message-passing multicomputers are presented.

The first family generalizes previous algorithms that use only half-duplex communications. The second family, which is difficult to understand without knowing the first family, improves on the communication time of the first family. The third and fourth adopt collective communication operations to further reduce the communication time of the previous two, respectively. The proposed algorithms have the shortest computation time of all prefix algorithms for the multicomputer models. The precondition of the proposed algorithms is also derived. Each of the four families of algorithms provides the flexibility of choosing either less computation time (and more communication time) or less communication time (and more computation time) to achieve the minimal running time. The key in choosing this and, more importantly, whether the running time is less than other prefix algorithms hinge on the ratio of the time required by a communication step to the time required by a computation step.

Keywords: Collective communication, cost optimality, half-duplex, message-passing

multicomputers, parallel prefix algorithms, computation

(3)

1. Introduction

The prefix computation is defined as follows: given n inputs x 1 , x 2 ,…, x n , and an associative binary operator ⊕, compute y _i = x ₁ ⊕ x ₂ ⊕…⊕ x _i , for 1 ≤ i ≤ n. For ease of presentation, unless otherwise stated, this study assumes that x i ’s and y i ’s represent inputs and outputs, respectively, and the number of inputs is n. Prefix computation has been extensively studied for its wide application in fields such as biological sequence comparison, cryptography, design of silicon compilers, job scheduling, image processing, loop parallelization, polynomial evaluation, processor allocation, and sorting [1-3, 8, 10, 12, 14, 21-24, 26, 47, 49, 51, 52]. The binary operation ⊕ can be as simple as a Boolean operation or an extremely time-consuming multiplication of matrices [11].

Because of its importance and usefulness, prefix computation has been proposed as a primitive operation [4]. In fact, prefix computation is a built-in operation for Message-Passing Interface (MPI) parallel programming [15], and is implemented in hardware in the Thinking Machines CM-5 [46].

Additionally, many parallel prefix algorithms for various parallel computing models have been proposed [1, 7, 9, 17, 22, 24, 27, 29, 33, 38, 39, 41, 42], and many parallel prefix circuits have also been designed [3, 5, 6, 13, 16, 19, 23-25, 29-32, 34-37, 40, 43, 44, 50-52].

This paper presents four families of computation-efficient parallel prefix algorithms for

message-passing multicomputers with p processing elements (PEs), where p < n. Half-duplex

communication is the weakest communication mode of message-passing multicomputers, with

which each PE of a multicomputer can only send or receive a message in a communication step.

(4)

This model of communication is basic and important [20]. Upper bounds of this model apply to other communication models, and lower bounds of this model are upper bounds for lower bounds of other models. Although a PE of a modern multicomputer can send and receive in the same step, it usually takes a longer time to send and receive than to send or receive only due to the inherent hardware capability and software overhead [18, 45]. On a p-PE system, the half-duplex communication ensures that no more than p/2 messages are transferred in a communication step and thus a communication step will not take too much time.

Egecioglu and Koc present a computation-efficient parallel prefix algorithm, henceforth named EK, for the half-duplex multicomputer model with p processing elements (PEs), where p < n [11].

Lin propose an algorithm, henceforth named L, to reduce the communication time on the same model [28]. Our first family of algorithms generalize Algorithm L such that algorithms of the family allow many combinations of the computation time and communication time. The Algorithm L is at one extreme of the family. The others of the family take less computation time and more communication time than Algorithm L. Users can thus take the exact time of performing ⊕ and that of communicating a message into account to choose an algorithm that requires the minimal running time. This study then shows how the communication time of the first family of algorithms can be further decreased to become the second family, which is difficult to understand if the first family is not known.

For ease of programming and efficiency, collective communication operations, such as

(5)

broadcast, can achieve the same effect as a sequence of point-to-point operations but in less time [48]. We thus use collective communications to further improve the communication time, and obtain two more families of algorithms from the first two families, respectively.

The rest of this paper is organized as follows. Section 2 presents the first family of parallel prefix algorithms for half-duplex message-passing multicomputers. Section 3 then derives the computation time and communication time, as well as other properties, of the first family. Section 4 shows how the communication time of the first family can be reduced to become the second family;

the new communication time is also derived. Section 5 proposes reducing the communication time by using collective communications to obtain two more families of algorithms. Section 6 shows that all the proposed algorithms are not always effective when p < n, and derives a much stronger precondition of the algorithms. Section 7 compares the new algorithms with previous ones for multicomputers. Conclusions are finally drawn in Section 8.

2. A family of parallel prefix algorithms

This section describes a family of parallel algorithms for solving the prefix problem of n with

using p PEs, where p < n, on the half-duplex multicomputer model. A later section will elaborate the

exact relation between p and n for the algorithms to be useful. The p PEs are represented by P 1 ,

P ₂ ,…, P _p . For ease of presentation, i:j is used to represent the result of computing x _i ⊕ x _i+1 ⊕…⊕ x _j ,

where i ≤ j. Like Algorithm EK and Lin, the proposed algorithms are most practical when the

(6)

amount of time required to perform a binary operation ⊕ is greater than that required to transfer a message between two PEs. This situation may happen, for example, when the binary operation is time-consuming matrix multiplication.

Algorithm A(n, p, k)

{Solving the prefix problem of n inputs, x 1 , x 2 ,…, x n , using p PEs to generate y 1 , y 2 ,…, y n , where n >

p = kq + 1, k ≥ 1, q ≥ 1. For ease of presentation, assume that all numerical values are integers.}

Phase 1: Partition the inputs into two parts N 1 = (x 1 , x 2 ,…, x v ) and N 2 = (x v+1 , x v+2 ,…, x n ), where 0 <

v < n. How the value of v is determined will be explained shortly. If p = k + 1, then P 1 uses N 1 to compute outputs y 1 , y 2 ,…, y v sequentially; otherwise, P 1 , P 2 ,…, P p–k use N 1 to compute y 1 , y 2 ,…, y v

by invoking A(v, p – k, k) recursively. In the mean time, N 2 is first distributed evenly among the other k PEs, P p–k+1 , P p–k+2 ,…, P p ; each of the PEs holds c = (n – v)/k input values. These k PEs then concurrently compute z 1 = (z 1,1 , z 1,2 ,…, z 1,c ), z 2 = (z 2,1 , z 2,2 ,…, z 2,c ),…, z k = (z k,1 , z k,2 ,…, z k,c ), respectively, where z i,j = (v + (i – 1)c + 1):(v + (i – 1)c + j). The value of v is chosen to make the total number of computation steps in this phase required by the first p – k PEs equal to that required by the other k PEs, and it is derived later. Note that y v is obtained by P p–k .

Phase 2: Initially, P p–k sends y v to all the other PEs. Next, P p–k+1 scatters, i.e., partitions and

distributes, z 1 among all the PEs evenly, each PE having c/p of the c values. All the PEs then

concurrently compute y _v+i = y _v ⊕ z _1,i , i = 1, 2,…, c, in c/p computation steps. Note that y _v+c is

(7)

computed by P p .

Phase m (m = 3, 4,…, k + 1): Initially, P p sends y v+(m–2)c to all the other PEs. Next, P p–k+m–1 scatters z m–1 among all the PEs evenly, each PE having c/p values. All the PEs then concurrently compute y _v+(m–2)c+i = y _v+(m–2)c ⊕ z _m–1,i , i = 1, 2,…, c, in c/p computation steps. Note that y _v+(m–1)c is computed by P p .

Let C(n, p, k) denote the number of computation steps required by Algorithm A(n, p, k), and R(n, p, k) denote the number of communication steps. Like the two previous papers [11, 28], this paper does not take into account the initial input data loading time. To help understand the algorithm, we give two examples in the following. First, consider the case when p = 4, k = 3.

Phase 1: Assign N 1 = (x 1 , x 2 ,…, x v ) to P 1 and N 2 = (x v+1 , x v+2 ,…, x n ) to P 2 , P 3 , P 4 . The prefixes of N 1 are computed by P 1 , and those of N 2 by the other PEs. By the rule of deciding v, the number of computation steps required by P 1 , v – 1, equals the number of parallel computation steps required by the other three PEs, (n – v)/3 – 1. Hence, v = n/4. After phase 1 has completed, P 1 obtains y 1 , y 2 ,…, y n/4 ; P 2 obtains z 1,1 , z 1,2 ,…, z 1,n/4 ; P 3 obtains z 2,1 , z 2,2 ,…, z 2,n/4 ; and P 4 obtains z 3,1 , z 3,2 ,…, z 3,n/4 . It takes n/4 – 1 computation steps.

Phase 2: P 1 initially sends y n/4 to P 2 , P 3 , P 4 in 3 communication steps. Then, P 2 sends 1/4 of the

n/4 prefixes computed in phase 1 to each of the other three PEs in 3 communication steps. That is,

z 1,1 through z 1,n/16 , z 1,n/8+1 through z 1,3n/16 , and z 1,3n/16+1 through z 1,n/4 are sent to P 1 , P 3 , P 4 ,

(8)

respectively. Subsequently, the four PEs compute n/4 outputs y n/4+1 , y n/4+2 ,…, y n/2 in n/16 parallel computation steps. At the end, P 4 has y n/2 .

Phase 3: P 4 initially sends y n/2 to the other PEs in 3 communication steps. Then, P 3 sends 1/4 of the n/4 prefixes computed in phase 1 to each of the other three PEs in 3 communication steps. That is, z 2,1 through z 2,n/16 , z 2,n/16+1 through z 2, n/8 , and z 2,3n/16+1 through z 2,n/4 are sent to P 1 , P 2 , P 4 , respectively. Subsequently, the four PEs compute n/4 outputs y n/2+1 , y n/2+2 ,…, y 3n/4 concurrently in n/16 computation steps. At the end, P 4 has y 3n/4 .

Phase 4: P 4 sends y 3n/4 and 1/4 of the n/4 prefixes computed in phase 1 to each of the other three PEs in 3 communication steps. That is, z 3,1 through z 3,n/16 , z 3,n/16+1 through z 3,n/8 , and z 3,n/8+1

through z 3,3n/16 are sent to P 1 , P 2 , P 3 , respectively. Subsequently, the four PEs concurrently compute n/4 outputs y 3n/4+1 , y 3n/4+2 ,…, y n in n/16 computation steps.

Therefore, the total number of computation steps is

C(n, 4, 3) = (n/4 – 1) + n/16 + n/16 + n/16 = 7n/16 – 1. (1) The total number of communication steps is

R(n, 4, 3) = 3 × 2 + 3 × 2 + 3 = 15.

Next, consider the case when p = 7, k = 3.

Phase 1: Assign N 1 = (x 1 , x 2 ,…, x v ) to the first four PEs, and assign N 2 = (x v+1 , x v+2 ,…, x n ) to the last three PEs. From Eq. (1), we know that P 1 , P 2 , P 3 , P 4 can compute the prefixes of N 1 in C(v, 4, 3)

= 7v/16 – 1 computation steps. In the mean time, P 5 , P 6 , P 7 share N 2 evenly and compute their

(9)

respective prefixes concurrently, taking (n – v)/3 – 1 computation steps. By the rule of deciding v, 7v/16 – 1 = (n – v)/3 – 1,

v = 16n/37.

Consequently, c = (n – v)/3 = 7n/37. Thus, the prefixes z 1,1 through z 1,7n/37 are computed in P 5 , z 2,1

through z 2,7n/37 in P 6 , and z 3,1 through z 3,7n/37 in P 7 . Note that P 4 obtains y v = y 16n/37 , and C(16n/37, 4, 3) = 7n/37 – 1.

Phase 2: P 4 initially sends y 16n/37 to the other six PEs in 6 communication steps. Then, P 5 sends 1/7 of the 7n/37 prefixes computed in phase 1 to each of the other six PEs in 6 communication steps.

Subsequently, the seven PEs compute 7n/37 outputs y 16n/37+1 , y 16n/37+2 ,…, y 23n/37 in n/37 parallel computation steps. At the end, P 7 has y 23n/37 .

Phase 3: P 7 initially sends y 23n/37 to the other six PEs in 6 communication steps. Then, P 6 sends 1/7 of the 7n/37 prefixes computed in phase 1 to each of the other six PEs in 6 communication steps.

Subsequently, the seven PEs concurrently compute 7n/37 outputs y 23n/37+1 , y 23n/37+2 ,…, y 30n/37 in n/37 computation steps. At the end, P 7 has y 30n/37 .

Phase 4: P 7 initially sends y 30n/37 and 1/7 of the 7n/37 prefixes computed in phase 1 to each of the other six PEs in 6 communication steps. The seven PEs then concurrently compute 7n/37 outputs y 30n/37+1 , y 30n/37+2 ,…, y n in n/37 computation steps.

Therefore, the total number of computation steps is

C(n, 7, 3) = 7n/37 – 1 + (n/37) × 3 = 10n/37 – 1.

(10)

The total number of communication steps is

R(n, 7, 3) = R(16n/37, 4, 3) + 6 × 5 = 45.

In the next section, the numbers of computation steps and communication steps in the general case, as well as other properties of Algorithm A, are derived.

3. Properties of Algorithm A

This section gives the value of v, C(n, p, k), and R(n, p, k). The values of p and k that lead to the smallest C(n, p, k) and R(n, p, k) are also derived.

Let v = α p,k n, where 0 < α p,k < 1. The value of α p,k is given in the following theorem.

Theorem 1: When p = k + 1, α p,k = 1/p; otherwise,

α p,k =

2 2

1 1

p kp k

− + +

+ + + . Proof: Since the proof is very length, it is given in the appendix.

Theorem 2:

2

2 ( )

( , , ) 1.

1 n p k C n p k

p kp k

= + −

+ + +

Proof: We consider two cases separately.

Case 1: p = k + 1. In phase 1 of Algorithm A(n, p, k), P 1 takes α p,k n – 1 computation steps to

sequentially compute the prefixes of the α _p,k n inputs assigned, and α _p,k n – 1 equals the number of

computation steps required by the other k PEs. In phases 2 through k + 1, totally (n – α p,k n)/p

computation steps are required to compute n – α p,k n values, precisely y v+1 , y v+2 ,…, y n , by all the p

(11)

PEs concurrently. Hence,

,

( , , )

_{p k},

1 n

^{p k}

n .

C n p k n

p

α ⁻ α

= − +

From Theorem 1,

_,

1

p k

p

α = ; thus,

2

/ 2 1

( , , ) n 1 n n p p 1.

C n p k n

p p p

− −

= − + = −

Since p = k + 1, we have

2

2 1

( , , ) 2 1

2 2 1

( 1)

2 ( )

1 1.

C n p k n p p

p k n p p k n p k

p kp k

= − −

= + −

+ +

= + −

+ + +

Case 2: p = kq + 1, where q ≥ 2. The number of computation steps in phase 1 required by P ₁

through P p–k equals that required by P p–k+1 through P p . That is,

,

(

_{p k},

, , ) n

^{p k}

n 1.

C n p k k

k

α − = ⁻ α − (2)

As mentioned in case 1, phases 2 through k + 1 requires a total of (n – α p,k n)/p computation steps. Hence,

,

( , , ) (

_{p k},

, , ) n

^{p k}

n .

C n p k C n p k k

p

α ⁻ α

= − +

By Eq. (2), this becomes

, ,

( )(1

,

)

( , , ) n

^{p k}

n 1 n

^{p k}

n n p k

^{p k}

1. C n p k

k p kp

α α α

− − + −

= − + = − (3)

By Theorem 1,

2

, 2

1 .

p k

1 p kp k

p kp k

α = ⁻ ^{+ +}

+ + +

Eq. (3) then becomes

(12)

2 2

2

( )(1 1 )

2 ( )

( , , ) 1 1 1.

1 p kp k

n p k

p kp k

C n p k

kp p kp k

− + +

+ −

+

+ + +

= − = −

+ + + Q.E.D.

We can then consider the impact of the values of p and k on the computation time. Let a = bq 1

+ 1, d = eq ₂ + 1, and b, e, q ₁ , q ₂ ≥ 1. Then

2 2

2 ( ) 2 ( )

( , , ) ( , , )

1 1

( )( 1) ( 1)( )

2 .

( 1)( 1)

n a b n d e

C n a b C n d e

a ab b d de e

a b d de e a ab b d e

n a ab b d de e

+ +

− = −

+ + + + + +

+ + + + − + + + +

= + + + + + +

The numerator above can be rewritten

ad ² + ade + ae + a + db ² + bde + be + b − (a ² d + abd + bd + d + a ² e + abe + be + e)

= ad(d − a) + ad(e − b) + (ae − bd) + (a − d) + (bd ² − a ² e) + be(d − a) + (b − e).

If a = d, the numerator becomes

a ² (e − b) + a(e − b) + a ² (b − e) + (b − e) = (e − b)(a − 1).

Clearly, if e > b, then the numerator is positive, and thus C(n, a, b) > C(n, a, e), i.e., A(n, p, e) takes

less computation time than A(n, p, b).

On the other hand, if b = e, the numerator becomes

ad(d − a) + b(a − d) + (a − d) + b(d ² − a ² ) + b ² (d − a)

= (d − a)(ad − b − 1 + ab + bd + b ² ).

Clearly, if d > a, then the numerator is positive, and thus C(n, a, b) > C(n, d, b).

To summarize, if d > a and e > b, then C(n, a, b) > C(n, d, b) > C(n, d, e). This means that

using as many PEs as possible and the maximal k, which equals p – 1, can achieve the minimal

(13)

computation time.

Theorem 3: R(n, p, 1) = p(p − 1);

R(n, p, k) = (2k − 1)(p – 1)(p + k – 1)/2k when k ≥ 2.

Proof: From Algorithm A, R(n, p, k) is the sum of the following components: (i) the number of communication steps, R(v, p – k, k), when the first p – k PEs perform A(v, p – k, k) in phase 1; (ii)

the number of communication steps required by P p–k to send y v to the other p – 1 PEs in phase 2; (iii)

the number of communication steps required by P p to send y v+ic to the other p – 1 PEs in phase i + 2,

for i = 1, 2,…, k – 1; and (iv) the number of communication steps taken to distribute a total of kc, or

n – v, z i,j values evenly among all the p PEs in phases 2 through k + 1. We consider the following

three cases.

Case 1: k = 1. The algorithm has only two phases, and it degenerates into Algorithm L. It has

been shown [28]

R(n, p, 1) = p(p – 1).

Case 2: k ≥ 2 and p = k + 1. Component (i) is 0; component (ii) is p – 1; component (iii) is

(k – 1)(p – 1), and component (iv) is k(p – 1). Note that in phase k + 1, component (iii) has the same

communication source and destinations as component (iv). These two components can become one

in phase k + 1; that is, p − 1 communication steps can be reduced. Hence,

R(n, p, k) = 0 + (p – 1) + (k – 1)(p – 1) + k(p – 1) − (p − 1)

(14)

= (p – 1)(2k – 1), (4)

which is equal to (2k – 1)(p – 1)(p + k – 1)/2k since p = k + 1.

Case 3: k ≥ 2 and p = k(i + 1) + 1, where i ≥ 1. Component (i) is R(v, p – k, k), which is

required by the first p – k PEs to communicate the prefixes of v inputs. Components (ii), (iii), and

(iv) are the same as those in case 2. Thus, using the result of case 2, Eq. (4), we have

R(n, p, k) = R(α p,k n, p – k, k) + (p – 1)(2k – 1)

= R(α p–k,k α p,k n, p – 2k, k) + (p – k – 1)(2k – 1) + (p – 1)(2k – 1)

= …

= R(α p–(i–1)k,k …α p–k,k α p,k n, p – ik, k) + (2k – 1)[(p – (i – 1)k – 1) +...+ (p – k – 1) + (p – 1)]

= R(α p–(i–1)k,k …α p–k,k α p,k n, p – ik, k) + (2k – 1)[i(p – 1) – (k + 2k +… + (i – 1)k)]

= R(α 2k+1,k …α p–k,k α p,k n, k + 1, k) + (2k – 1)[i(p – 1) – ( 1) 2 ki i −

]

= R(α 2k+1,k …α p–k,k α p,k n, k + 1, k) + 1 2 1

(2 1)[ ( 1) ( 1) ].

2 p k p k

k p p k

k k

− − − −

− − − − −

Since Eq. (4) can be rewritten

R(n, k + 1, k) = k (2k – 1),

which is independent of the value of n, we obtain

R(n, p, k) = k (2k – 1) + (2k – 1)

2 2 2

2 1 2 1 3 3 2

( )

2 p p kp k p p kp k k

k k

− + − + − + − + +

−

= (2k – 1)

2 2 2 2

2 2 4 2 2 2 2 1 3 3 2

2 k p p kp k p p kp k k

k

+ − + − + − + − + − −

= (2k – 1)(p – 1)(p + k – 1)/2k. Q.E.D.

(15)

Theorem 3 reveals the effect of p and k on the communication time. Clearly, a smaller p

results in less communication time. To see the effect of k, we first note that when k ≥ 2, a smaller k

leads to less communication time. In addition, R(n, p, 2) = 3(p – 1)(p + 1) > R(n, p, 1). Thus, a

smaller k also results in less communication time.

However, as already mentioned, a larger k or p leads to less computation time. Thus, it is not

straightforward to decide the best values of p or k that can achieve the least running time. Let σ be

the ratio of the time required by a communication step to the time required by a computation step.

The total running time of the algorithm is proportional to C(n, p, k) + σR(n, p, k), which should be

minimized by choosing appropriate p and k values to solve the prefix problem as fast as possible.

From Theorem 2, C(n, p, k) = Θ(n/p); from Theorem 3, R(n, p, k) = Θ(p ² ). Totally, Algorithm

A takes Θ(n/p) + Θ(p ² ) time. When n = Ω(p ³ ), n/p = Ω(p ² ), and thus Θ(n/p) + Θ(p ² ) = Θ(n/p). Since

the sequential solution for the prefix problem takes Θ(n) time, Algorithm A is cost optimal when n =

Ω(p ³ ).

4. Reducing communication time

This section proposes modifying Algorithm A when p ≥ 2k + 1 and k ≥ 2, to obtain a faster

algorithm named B. Recall that in the first phase of Algorithm A, the value of v is chosen to make

the number of computation steps required by the first p – k PEs equal to the number of computation

steps required by the other k PEs. Note that when p ≥ 2k + 1 and k ≥ 2, the first p – k PEs

(16)

communicate among themselves in phase 1, but the last k PEs perform no communication

operations; thus, the last k PEs are idle for the amount of time that the first p – k PEs take to

communicate.

To reduce this idle time, we note that in Algorithm A there are communications among the last

k PEs in phases 2 through k + 1, which can be performed in phase 1, but the communications

involving the first p – k PEs in phases 2 through k + 1 should not be moved to phase 1. The original

communications in each of these k phases that can be moved to phase 1 are transfers from a PE to

k – 1 PEs. Thus, totally k(k − 1) communication steps can be moved to phase 1.

However, we need to make sure whether all or only part of the k(k – 1) communication steps

should be moved. Comparing k(k – 1) with the number of communication steps in phase 1 of

Algorithm A, which is R(v, p – k, k), can clarify this. If R(v, p – k, k) > k(k – 1), the k(k – 1)

communication steps can all be performed in phase 1 without increasing the communication time.

From Theorem 3,

R(v, p – k, k) = (2k − 1)(p − 1)(p + k − 1)/2k.

Since p ≥ 2k + 1, we have

R(v, p – k, k) ≥ (2k − 1)(2k)(3k)/2k = 3k(2k − 1).

Thus,

R(v, p – k, k) – k(k – 1) ≥ 6k ² – 3k – k ² + k = k(5k – 2) > 0.

That is, R(v, p – k, k) > k(k − 1). Therefore, Algorithm B takes less communication time than

(17)

Algorithm A when p ≥ 2k + 1 and k ≥ 2. For the other valid values of p and k, the two algorithms are

the same and thus take the same amount of communication time. Note that Algorithm B takes the

same computation time as Algorithm A.

Let S(n, p, k) denote the number of communication steps required by Algorithm B(n, p, k).

When p ≥ 2k + 1 and k ≥ 2, S(n, p, k) is the sum of the following components: (i) S(v, p – k, k)

communication steps required by the first p – k PEs invoking B(v, p – k, k) in phase 1; (ii) p – 1

communication steps required by P p–k to send y v to the other p – 1 PEs in phase 2; (iii) (k – 1)(p – 1)

communication steps required by P p to send y v+ic to the other p – 1 PEs in phase i + 2, for i = 1, 2,…,

k – 1; and (iv) k(p – k) communication steps taken to distribute part of the z i,j values evenly among

P 1 , P 2 ,… P p–k in phases 2 through k + 1. Note that in phase k + 1, since the communications that

contribute to component (iii) and those that contribute to component (iv) are all sent out by P p , the

latter communications can be merged to the former ones, eliminating p − k communication steps in

this phase. Let p = k(i + 1) + 1, where i ≥ 1. Hence,

S(n, p, k) = S(α p,k n, p – k, k) + (p – 1) + (k – 1)(p – 1) + k (p – k) – (p – k)

= S(α p,k n, p – k, k) + k(p – 1) + (k – 1)(p – k)

= S(α p–k,k α p,k n, p – 2k, k) + k(p – k – 1) + (k – 1)(p – 2k) + k(p – 1) + (k – 1)(p – k)

= …

= S(α p–(i–1)k,k …α p–k,k α p,k n, p – ik, k) + k[(p – (i – 1)k – 1) +...+ (p – k – 1) + (p – 1)]

+ (k – 1)[(p – ik) +...+(p – 2k) + (p – k)]

(18)

= S(α p–(i–1)k,k …α p–k,k α p,k n, p – ik, k) + k[i(p – 1) – (k + 2k +… + (i – 1)k)]

+ (k – 1)[ip – (k + 2k +...+ ik)]

= S(α p–(i–1)k,k …α p–k,k α p,k n, p – ik, k) + k ( 1)

[ ( 1) ]

2 i p ik i −

− − + (k – 1) ( 1)

[ ]

2 ip ik i +

−

= S(α 2k+1,k …α p–k,k α p,k n, k + 1, k) + 1 ( 1) 2 1

[ ( 1) ]

2 p k p k p k

k p

k k

− − − − − −

− −

1 ( 1) 1

( 1)[ ]

2 p k p k p

k p

k k

− − − − −

+ − − .

Since S(n, p, k) = R(n, p, k) when p = k + 1, from Theorem 3, when k ≥ 2,

S(n, k + 1, k) = R(n, k + 1, k) = k(2k – 1).

Thus,

S(n, p, k) = k(2k – 1) + 1 ( 1) 2 1

[ ( 1) ]

2 p k p k p k

k p

k k

− − − − − −

− −

1 ( 1) 1

( 1)[ ]

2 p k p k p

k p

k k

− − − − −

+ − −

= k(2k – 1) + 2 1 2 k

k

− p ² +

2 2

k k k k

k

− − +

p +

3 2 2

2 1

2 k k k k

k

− + − − +

= 2 1

²

1

²

1 2 .

2 2 2

k k

p p k k

k k

− +

− + − +

Clearly, S(n, p, k) = Θ(p ² ). Therefore, the discussion of cost optimality of Algorithm A applies

to Algorithm B. That is, Algorithm B is also cost optimal when n = Ω(p ³ ).

5. Using collective communications

This section considers reducing the communication time of algorithms A and B when

collective communications are available. Let us first consider modifying Algorithm A to obtain a

new one named E. In phase i + 2, 0 ≤ i ≤ k – 1, of Algorithm A, the p – 1 transfers of y v+ic can be

(19)

replaced by a single broadcast to achieve the same effect. In addition, scattering of z i in phase i + 1,

1 ≤ i ≤ k, can be done with a single scatter operation.

With a suitable implementation, either a broadcast or scatter operation takes Θ(log p) time to

send to p PEs on a multicomputer. Thus, broadcasting y v , y v+c ,…, y v+(k–1)c is equivalent to take

c 1 k log p steps, where c 1 is a constant, and k scatter operations take c 2 k log p steps, where c 2 is a

constant. Let c = c ₁ + c ₂ and p = k(i + 1) + 1, where i ≥ 0 . We use T(n, p, k) to denote the number of communication steps required by Algorithm E(n, p, k), and consider the following three cases to

examine T(n, p, k).

Case 1: k ≥ 2 and p = k + 1. The communications are k broadcast operations and k scatter

operations. Thus,

T(n, p, k) = ck log p = c(p – 1) log p = Θ(p log p).

Case 2: k ≥ 2 and p ≥ 2k + 1.

T(n, p, k) = T(α p,k n, p – k, k) + ck log p

= T(α p–k,k α p,k n, p – 2k, k) + ck log (p – k) + ck log p

= ...

= T(α _{p–(i–1)k,k} …α _p–k,k α _p,k n, p – ik, k) + ck

1

0

log ( )

i

j

p jk

−

=

∑

−

= T(α 2k+1,k …α p–k,k α p,k n, k + 1, k) + ck

( 2 1) /

0

log ( )

p k k

j

p jk

− −

=

∑

− ^.

Case 1 gives T(n, k + 1, k) = ck log (k + 1), and ck

( 2 1) /

0

log ( )

p k k

j

p jk

− −

=

∑

− ^{< ck}

1

log

p

i

=

∑

= Θ(p log p).

(20)

Thus, T(n, p, k) = O(p log p).

Case 3: k = 1. Similar to the derivation of T(n, p, k) in Case 2, it has been obtain that T(n, p, 1)

= Θ(p log p) [28].

Thus, we have T(n, p, k) = O(p log p). Moreover, since Theorem 2 gives C(n, p, k) = Θ(n/p),

Algorithm E takes Θ(n/p) + O(p log p) time. If n = Ω(p ² log p), then n/p = Ω(p log p), and thus

Θ(n/p) + O(p log p) = Θ(n/p). Since the sequential solution for the prefix problem takes Θ(n) time,

the algorithm is cost optimal when n = Ω(p ² log p).

The modification to Algorithm B by using broadcast and scattering is very similar to the above

modification, and results in the same communication time complexity. Thus, the resulting algorithm

is also cost optimal when n = Ω(p ² log p).

6. Precondition of algorithms

This section shows that a stronger relation than p < n is required to use the proposed

algorithms effectively. Indeed, a stronger relation among n, p, and k is needed. Before going into the

general case, we first use two examples to shed some light.

Suppose n = 1024, p = 256, and k = 85. From Theorem 2, we have C(1024, 256, 85) < 7.

However, it is even impossible to compute the sum of 1024 inputs in 7 computation steps. Therefore,

Theorem 2 does not hold under this situation.

As a more general case, suppose n = 1024, p = 256 = kq + 1, and q ≥ 2. From Theorem 1, v =

(21)

1024(256 ² – 256k + k + 1)/(256 ² + 256k + k + 1) inputs are assigned to the first 256 – k PEs, and n –

v = 2kpn/(p ² + kp + k + 1) = 512×1024k /(256 ² + 256k + k + 1) < 8k inputs are assigned to the last k

PEs. That is, each of the last k PEs has at most 8 inputs. Then, in phase 2, P 257–k scatters at most 8

values to at most 8 of the 256 PEs for further computation, and thus at least 248 PEs are idle when

the others are communicating and computing. This ineffective use of PEs also happens in phases 3

through k + 1.

Thus, in phase 1 at least kp inputs should be assigned to the last k PEs, which guarantees that

in later phases each PE can be assigned at least one value to compute. Using n – v ≥ kp and

Theorem 1, we see that

n – n(p ² – kp + k + 1)/(p ² + kp + k +1) ≥ kp,

n ≥ (p ² + kp + k + 1)/2.

This is the stronger precondition for running the presented algorithms effectively.

7. Comparison with other algorithms

Since Algorithm L is a special case of A(n, p, k), precisely A(n, p, 1), it takes C(n, p, 1)

computation steps and R(n, p, 1) communication steps. We have shown that a larger k results in less

computation time, i.e., C(n, p, 1) > C(n, p, k) for k ≥ 2. Specifically,

C(n, p, 1) – C(n, p, k) = 2n(k – 1)(p – 1)/[(p ² + p + 2)(p ² + kp + k + 1)] > 0.

We have also shown that a larger k results in more communication time. To compare the

(22)

number of communication steps of A(n, p, k) and A(n, p, 1), the difference between R(n, p, 1) and

R(n, p, k), where k ≥ 2, is

R(n , p, 1) – R(n, p, k) = (p ² – p) – (2k – 1)(p – 1)(p + k – 1)/2k

= (p – 1)(p – (2k ² – 3k + 1))/2k.

Therefore, R(n, p, k) ≤ R(n, p, 1) when p ≥ 2k ² – 3k + 1 and k ≥ 2. Taking both the computation and

communication times into account, when p < 2k ² – 3k + 1, we must know the ratio of the time

required by a computation step to the time required by a communication step to decide which

algorithm is faster, but when p ≥ 2k ² – 3k + 1 and k ≥ 2, A(n, p, k) is definitely faster.

Algorithm B can also be faster than Algorithm L. Precisely, when k ≥ 2,

R(n , p, 1) – S(n, p, k) = (p ² – p) – ((2k – 1)p ² /2k – p/2 + k ² – 2k + (k + 1)/2k)

= (p ² – kp – (2k ³ – 4k ² + k + 1))/2k.

Therefore, when p ² – kp > 2k ³ – 4k ² + k + 1 and k ≥ 2, Algorithm B is faster than Algorithm L.

Lin and Lin present a parallel prefix algorithm named PLL for the half-duplex multicomputer

[33]; PLL requires 2n/p + 1.44 log 2 p – 1 computation steps and 1.44 log 2 p + 1 communication

steps when using p PEs, where 10 ≤ p < n. The number of computation steps of Algorithm A or B is

less than that of PLL, but the number of communication steps is greater than that of PLL.

Parallel algorithms requiring 2n/p + (m + 1) log m+1 p – 2 computation steps and log m+1 p

communication steps on an m-port multicomputer of p < n PEs have also been presented [38]. With

m-port communication, each PE has m input ports and m output ports to communicate with other

(23)

PEs in one communication step, for some m ≥ 1. The number of communication steps of Algorithm

E is larger than those of the previous algorithms. Thus, Algorithm E must have a C(n, p, k) that is

small enough for it to be faster than the other algorithms. Because all the other algorithms need

more than 2n/p computation steps, Algorithm E may be faster than the others when the computation

improvement is good enough to compensate for the communication disadvantage.

Note that the ratio of the time required by a computation step to the time required by a

communication step has an impact on whether a new algorithm is faster. If the ratio is large, which

may be true when the binary operation ⊕ is matrix multiplication, the new algorithms are more probable to be faster than previous ones.

8. Conclusion

This paper has presented two families of parallel algorithms, A(n, p, k) and B(n, p, k), run on

half-duplex multicomputers with p PEs to solve the prefix problem of n inputs, where n ≥ (p ² + kp +

k + 1)/2. The numbers of computation steps and communication steps have been derived. The new

algorithms are cost optimal when n = Ω(p ³ ). When p = k + 1, the proposed algorithms take the least

computation time and the most communication time among all prefix algorithms for

multicomputers. For many possible values of the ratio of the time required by a communication step

to the time required by a computation step, the user may find one of the presented algorithms takes

the minimal running time.

(24)

Either of the two families of algorithms can be transformed into another family by adopting

broadcast and scatter operations to reduce the communication time. The resulting two families are

cost optimal when n = Ω(p ² log p).

For all the presented algorithms, the last k PEs are idle for some amount of time in phase 1.

Thus, in phase 1, if we assign fewer than α _p,k n inputs to the first p – k PEs and thus more than

n(1 – α _p,k ) inputs to the last k PEs, then the first p – k PEs can do fewer computations and the last k

PEs can do more. Thus, the idle time can be eliminated. Note that since the first p – k PEs now take

less computation time in phase 1, this phase can take less time. How many inputs should be

assigned to the first p – k PEs is an open problem.

Each of the four families of algorithms provides the flexibility of choosing either less

computation time (and more communication time) or less communication time (and more

computation time) to achieve the minimal running time. The key in choosing this and, more

importantly, whether the running time is less than other prefix algorithms hinge on the ratio of the

time required by a communication step to the time required by a computation step.

(25)

Appendix. Proof of Theorem 1 We consider two cases separately.

Case 1: p = k + 1. In phase 1 the number of computation steps required by P 1 equals the

number of parallel computation steps required by the other k PEs. That is,

,

1

^{p k}

1.

p k

n n

n k

α − = ⁻ α −

Thus,

,

1 1

1 .

p k

k p

α = =

+ (5) Case 2: p = kq + 1, where q ≥ 2. The number of computation steps in phase 1 required by P ₁

through P p–k equals that required by P p–k+1 through P p . That is,

,

(

_{p k},

, , ) n

^{p k}

n 1.

C n p k k

k

α − = ⁻ α − (6)

Phases 2 through k + 1 require a total of (n – α p,k n)/p computation steps. Hence,

,

( , , ) (

_{p k},

, , ) n

^{p k}

n

C n p k C n p k k

p

α ⁻ α

= − +

, ,

( )(1

,

)

1 1.

p k p k p k

n n n n n p k

k p kp

α α α

− − + −

= − + = − (7)

Substituting n and p in Eq. (7) with α _p,k n and p − k, respectively, we obtain

,

, ,

(1 )

( , , ) 1.

( )

p k k

p k p k

C n p k k n p

k p k

α − = α ⁻ α

⁻

−

− (8) From Eqs. (6) and (8) we have

,

(

, , ,

)

1 1,

( )

p k p k p k k p k

n n p n n

k k p k

α α α

₋

α

− −

− = −

−

,

2 .

p k

p k k

p k

p k p

α α

₋

= −

− − (9)

Then, let r i = ip – ki(i + 1)/2 and s i = (i + 1)p – ki(i + 1)/2. We prove by induction on i that

(26)

1 , ,

1 ,

i i p ki k

,

p k

i i p ki k

r r

s s

α α

α

− −

= −

− for i ≥ 1. (10)

Base step: Since r 0 = 0, s 0 = p, r 1 = p – k, and s 1 = 2p – k, we have

1 0 ,

1 0 , ,

2 .

p k k

p k k p k k

r r p k

s s p k p

α

α α

−

− −

− = − −

From Eq. (9), we have

1 0 ,

,

1 0 ,

p k k

.

p k

p k k

r r

s s

α α

α

−

= −

− Induction step: Assume that

1 ,

,

1 ,

t t p kt k

p k

t t p kt k

r r

s s

α α

α

− −

= −

− , (11) we are to show that

1 ( 1),

,

1 ( 1),

t t p k t k

p k

t t p k t k

r r

s s

α α

α

+ − +

= −

− . (12) Substituting p in Eq. (9) with p – kt, we obtain

,

( 1),

2( ) ( ) .

p kt k

p k t k

p kt k

p kt k p kt

α

⁻

α

₋ ₊

− −

= − − − −

Thus, Eq. (11) can be rewritten

1

( 1), ,

1

( 1),

2 2 ( )

t t

p k t k p k

t t

p k t k

p kt k r r

p kt k p kt

p kt k

s s

p kt k p kt

α α

α

−

− +

−

− +

− −

− − − − −

= − −

− − − − −

( 1), 1

(2 2 ( ) ) ( )

t p k t k t

r p kt k p kt r p kt k

s p kt k p kt s p kt k

α α

− + −

− − − − − − −

= − − − − − − −

1 ( 1),

(2 2 ) ( ) ( )

t t t p k t k

r p kt k r p kt k r p kt

s p kt k s p kt k s p kt

α α

− − +

− − − − − − −

= − − − − − − − . (13)

By definition, r t+1 = (t + 1)p – k(t + 1)(t + 2)/2, r t = tp – kt(t + 1)/2, and r t–1 = (t – 1)p – k(t – 1)t/2.

These lead to

(27)

r t−1 = r t – p + kt,

r t+1 = r t + p – kt – k.

Thus,

r t (2p – 2kt – k) – r t−1 (p – kt – k) = r t (2p – 2kt – k) – (r t – p + kt) (p – kt – k)

= (p – kt)(r t + p – kt – k)

= (p – kt)r t+1 . (14)

By definition, s t+1 = (t + 2)p – k(t + 1)(t + 2)/2, s t = (t + 1)p – kt(t + 1)/2, and s t–1 = tp – k(t – 1)t/2.

These lead to

s t−1 = s t – p + kt,

s t+1 = s t + p – kt – k.

Thus,

s t (2p – 2kt – k) – s t−1 (p – kt – k) = s t (2p – 2kt – k) – (s t – p + kt) (p – kt – k)

= (p – kt)(s t + p – kt – k)

= (p – kt)s t+1 . (15)

Using Eqs. (14) and (15), we see that Eq. (13) can be rewritten

1 ( 1),

,

1 ( 1),

( ) ( )

( ) ( ) .

t t p k t k

p k

t t p k t k

p kt r r p kt p kt s s p kt α α

α

+ − +

− − −

= − − −

This can be reduced to Eq. (12), and thus proves Eq. (10).

Setting i = (p – k – 1)/k for Eq. (10), we have

1 1,

, ⁱ ⁱ ^k ^k

,

p k

r r s s α α

α

− +

= −

− where i = (p – k – 1)/k.

(28)

From Eq. (5), α k+1,k = 1/(k + 1). In addition, the definition of r i implies r i = r i–1 + p – ki, and the

definition of s i implies s i = s i–1 + p – ki. Thus,

1 1

,

1 1

i i

p k

i i

r p ki r k s p ki s

k α

−

+ − −

= +

+ − −

+

1

1 1 , where = ( 1) / . 1 1

i

k r k

k i p k k

k s k

k

−

+ + +

= − −

+ + +

(16)

Since i – 1 = (p – 2k – 1)/k, from the definitions of r i and s i , we have

1

2 1 2 1 1

2 ,

1 2 1 1

2 .

i

p k k p k p k

r p

k k k

p k k p k p k

s p

k k k

−

− − − − − −

= −

− − − − − −

= −

Thus, Eq. (16) can be written

,

2 1 2 1 1

( ) 1

1 2

1 2 1 1

( ) 1

1 2

p k

k p k k p k p k

p k

k k k k

k p k k p k p k

p k

k k k k

α

− − − − − −

− + +

= +

− − − − − −

− + +

+

2 2

2 3 1

1 2 1

2 3 1

1 2 1

k p pk k k

k k k

k p pk k k

k k k

− − − −

+ + +

= + − − −

+ + +

2 2

2 3 1

2( 1) 1

2 3 1

2( 1) 1

p pk k k

k k

p pk k k

k k

− − − −

+ + +

= + − − −

+ + +

2 2

2 3 1 2( 1)( 1)

p pk k k k k

− − − − + + +

= + − − − + + +

2 2

1 . 1

p pk k

− + +

= + + + Q.E.D.

(29)

References

[1] S.G. Akl, Parallel Computation: Models and Methods, Prentice-Hall, Upper Saddle River, NJ, 1997.

[2] S. Aluru, N. Futamura, K. Mehrotra, Parallel biological sequence comparison using prefix computaions, J. Parallel Distrib. Comput. 63 (3) (2003) 264-272.

[3] A. Bilgory, D.D. Gajski, A heuristic for suffix solutions, IEEE Trans. Comput. C-35 (1) (1986) 34-42.

[4] G.E. Blelloch, Scans as primitive operations, IEEE Trans. Comput. 38 (11) (1989) 1526-1538.

[5] R.P. Brent, H.T. Kung, A regular layout for parallel adders, IEEE Trans. Comput. C-31 (3) (1982) 260-264.

[6] D.A. Carlson, B. Sugla, Limited width parallel prefix circuits, J. Supercomput. 4 (2) (1990) 107-129.

[7] L. Cinque, G. Bongiovanni, Parallel prefix computation on a pyramid computer, Pattern Recognition Lett. 16 (1) (1995) 19-22.

[8] R. Cole, U. Vishkin, Faster optimal parallel prefix sums and list ranking, Infom. Contr. 81 (1989) 334-352.

[9] A. Datta, Multiple addition and prefix sum on a linear array with a reconfigurable pipelined bus system, J. Supercomput. 29 (3) (2004) 303-317.

[10] C. Efstathiou, H.T. Vergos, D. Nikolos, Fast parallel-prefix modulo 2

ⁿ

+ 1 adders, IEEE Trans.

Comput. 53 (9) (2004) 1211-1216.

[11] O. Egecioglu, C.K. Koc, Parallel prefix computation with few processors, Computers Math.

Applic. 24 (4) (1992) 77-84.

[12] A. Ferreira, S. Ubeda, Parallel complexity of the medial axis computation, in: Proc. Int. Conf.

on Image Processing, Washington, D.C., vol. 2 1995, pp. 105-108.

[13] F.E. Fich, New bounds for parallel prefix circuits, in: Proc. 15th Symp. on the Theory of Computing, 1983, pp. 100-109.

[14] A.L. Fisher, A.M. Ghuloum, Parallelizing complex scans and reductions, in: Proc. ACM SIGPLAN '94 Conf. on Programming Language Design and Implementation, Orlando, FL, 1994, pp. 135-146.

[15] W. Gropp, E. Lusk, A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, MIT Press, Cambridge, MA, 1994.

[16] T. Han, D.A. Carlson, Fast area-efficient VLSI adders, in: Proc. 8th Computer Arithmetic Symp., Como, Italy, 1987, pp. 49-56.

[17] D.R. Helman, J. JaJa, Prefix computations on symmetric multiprocessors, J. Parallel Distrib.

Comput. 61 (2001) 265-278.

[18] Inmos, The Transputer Databook, 3rd ed., Inmos, Almondsbury, Bristol, UK, 1992.

[19] P.M. Kogge, H.S. Stone, A parallel algorithm for the efficient solution of a general class of

(30)

recurrence equations, IEEE Trans. Comput. C-22 (8) (1973) 783-791.

[20] D.W. Krumme, G. Cybenko, K.N. Venkataraman, Gossiping in minimal time, SIAM J.

Comput. 21 (1) (1992) 111-139.

[21] C.P. Kruskal, T. Madej, L. Rudolph, Parallel prefix on fully connected direct connection machines, in: Proc. Int. Conf. on Parallel Processing, St. Charles, IL, 1986, pp. 278-284.

[22] C.P. Kruskal, L. Rudolph, M. Snir, The power of parallel prefix, IEEE Trans. Comput. C-34 (10) (1985) 965-968.

[23] R.E. Ladner, M.J. Fischer, Parallel prefix computation, J. ACM 27 (4) (1980) 831-838.

[24] S. Lakshmivarahan, S.K. Dhall, Parallel Computing Using the Prefix Problem, Oxford University Press, Oxford, UK, 1994.

[25] S. Lakshmivarahan, C.M. Yang, S.K. Dhall, On a new class of optimal parallel prefix circuits with (size + depth) = 2n – 2 and lοg n ≤ depth ≤ (2 log n – 3), in: Proc. Int. Conf. on Parallel Processing, St. Charles, IL, 1987, pp. 58-65.

[26] F.T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann, San Mateo, CA, 1992.

[27] R. Lin, K. Nakano, S. Olariu, M.C. Pinotti, J.L. Schwing, A.Y. Zomaya, Scalable hardware-algorithms for binary prefix sums, IEEE Trans. Parallel Distributed Syst. 11 (8) (2000) 838-850.

[28] Y.-C. Lin, A family of computation-efficient parallel prefix algorithms, WSEAS Trans.

Comput. 5 (12) (2006) 3060-3066.

[29] Y.-C. Lin, Optimal parallel prefix circuits with fan-out 2 and corresponding parallel algorithms, Neural, Parallel & Scientific Computations 7 (1) (1999) 33-42.

[30] Y.-C. Lin, J.-N. Chen, Z4: A new depth-size optimal parallel prefix circuit with small depth, Neural, Parallel & Scientific Computations 11 (3) (2003) 221-235.

[31] Y.-C. Lin, J.-W. Hsiao, A new approach to constructing optimal parallel prefix circuits with small depth, J. Parallel Distrib. Comput. 64 (1) (2004) 97-107.

[32] Y.-C. Lin, Y.-H. Hsu, C.-K. Liu, Constructing H4, a fast depth-size optimal parallel prefix circuit, J. Supercomput. 24 (3) (2003) 279-304.

[33] Y.-C. Lin, C.M. Lin, Efficient parallel prefix algorithms on multicomputers, J. Information Science and Engineering 16 (1) (2000) 41-64.

[34] Y.-C. Lin, C.-K. Liu, Finding optimal parallel prefix circuits with fan-out 2 in constant time, Inform. Process. Lett. 70 (4) (1999) 191-195.

[35] Y.-C. Lin, C.-C. Shih, A new class of depth-size optimal parallel prefix circuits, J.

Supercomput. 14 (1) (1999) 39-52.

[36] Y.-C. Lin, C.-C. Shih, Optimal parallel prefix circuits with fan-out at most 4, in: Proc. 2nd IASTED Int. Conf. on Parallel and Distributed Computing and Networks, Brisbane, Australia, 1998, pp. 312-317.

[37] Y.-C. Lin, C.-Y. Su, Faster optimal parallel prefix circuits: New algorithmic construction, J.

(31)

Parallel Distrib. Comput. 65 (12) (2005) 1585-1595.

[38] Y.-C. Lin, C.-S. Yeh, Efficient parallel prefix algorithms on multiport message-passing systems, Inform. Process. Lett. 71 (2) (1999) 91-95.

[39] Y.-C. Lin, C.-S. Yeh, Optimal parallel prefix on the postal model, J. Information Science and Engineering 19 (1) (2003) 75-83.

[40] J. Liu, S. Zhou, H. Zhu, C.-K. Cheng, An algorithmic approach for generic parallel adders, in:

Proc. ICCAD, San Jose, CA, 2003, pp. 734-740.

[41] R. Manohar, J.A. Tierno, Asynchronous parallel prefix computation, IEEE Trans. Comput. 47 (11) (1998) 1244-1252.

[42] E.E. Santos, Optimal and efficient algorithms for summing and prefix summing on parallel machines, J. Parallel Distrib. Comput. 62 (2002) 517-543.

[43] M. Sheeran, I. Parberry, A new approach to the design of optimal parallel prefix circuits, Technical Report No. 2006:1, Department of Computer Science and Engineering, Chalmers University of Technology, Goteborg, Sweden, 2006.

[44] M. Snir, Depth-size trade-offs for parallel prefix computation, J. Algorithms 7 (1986) 185-201.

[45] M. Snir, P. Hochschild, D.D. Frye, K.J. Gildea, The communication software and parallel environment of the IBM SP2, IBM Syst. J. 34 (2) (1995) 205-221.

[46] Thinking Machines, Connection Machine Parallel Instruction Set (PARIS), Thinking Machines, Cambridge, MA, 1986.

[47] H. Wang, A. Nicolau, K.S. Siu, The strict time lower bound and optimal schedules for parallel prefix with resource constraints, IEEE Trans. Comput. 45 (11) (1996) 1257-1271.

[48] Z. Xu, K. Hwang, Modeling communication overhead: MPI and MPL performance on the IBM SP2, IEEE Parallel & Distributed Technology 4 (1) (1996) 9-23.

[49] F. Zhou, P. Kornerup, Computing moments by prefix sums, J. VLSI Signal Process. Systems 25 (1) (2000) 5-17.

[50] H. Zhu, C.-K. Cheng, R. Graham, Constructing zero-deficiency parallel prefix circuits of minimum depth, ACM Trans. on Design Automation of Electronic Systems 11 (2) (2006) 387-409.

[51] R. Zimmermann, Binary Adder Architectures for Cell-Based VLSI and Their Synthesis, PhD thesis, Swiss Federal Institute of Technology (ETH), Zurich, 1997.

Families of Parallel Prefix Algorithms for Multicomputers

National Taiwan University of Science and Technology

Department of Computer Science and Information Engineering

Families of Parallel Prefix Algorithms for Multicomputers

林彥君 Yen-Chun Lin 洪麗玲 Li-Ling Hung

Technical Report: NTUST-CSIE-07-03

June 2007

Abstract

Four families of parallel prefix algorithms for message-passing multicomputers are presented.

Keywords: Collective communication, cost optimality, half-duplex, message-passing

multicomputers, parallel prefix algorithms, computation

1. Introduction

Because of its importance and usefulness, prefix computation has been proposed as a primitive operation [4]. In fact, prefix computation is a built-in operation for Message-Passing Interface (MPI) parallel programming [15], and is implemented in hardware in the Thinking Machines CM-5 [46].

Additionally, many parallel prefix algorithms for various parallel computing models have been proposed [1, 7, 9, 17, 22, 24, 27, 29, 33, 38, 39, 41, 42], and many parallel prefix circuits have also been designed [3, 5, 6, 13, 16, 19, 23-25, 29-32, 34-37, 40, 43, 44, 50-52].

This paper presents four families of computation-efficient parallel prefix algorithms for

message-passing multicomputers with p processing elements (PEs), where p < n. Half-duplex

communication is the weakest communication mode of message-passing multicomputers, with

which each PE of a multicomputer can only send or receive a message in a communication step.

Egecioglu and Koc present a computation-efficient parallel prefix algorithm, henceforth named EK, for the half-duplex multicomputer model with p processing elements (PEs), where p < n [11].

For ease of programming and efficiency, collective communication operations, such as

broadcast, can achieve the same effect as a sequence of point-to-point operations but in less time [48]. We thus use collective communications to further improve the communication time, and obtain two more families of algorithms from the first two families, respectively.

2. A family of parallel prefix algorithms

This section describes a family of parallel algorithms for solving the prefix problem of n with

using p PEs, where p < n, on the half-duplex multicomputer model. A later section will elaborate the

exact relation between p and n for the algorithms to be useful. The p PEs are represented by P 1 ,

P 2 ,…, P p . For ease of presentation, i:j is used to represent the result of computing x i ⊕ x i+1 ⊕…⊕ x j ,

where i ≤ j. Like Algorithm EK and Lin, the proposed algorithms are most practical when the

amount of time required to perform a binary operation ⊕ is greater than that required to transfer a message between two PEs. This situation may happen, for example, when the binary operation is time-consuming matrix multiplication.

Algorithm A(n, p, k)

{Solving the prefix problem of n inputs, x 1 , x 2 ,…, x n , using p PEs to generate y 1 , y 2 ,…, y n , where n >

p = kq + 1, k ≥ 1, q ≥ 1. For ease of presentation, assume that all numerical values are integers.}

Phase 1: Partition the inputs into two parts N 1 = (x 1 , x 2 ,…, x v ) and N 2 = (x v+1 , x v+2 ,…, x n ), where 0 <

v < n. How the value of v is determined will be explained shortly. If p = k + 1, then P 1 uses N 1 to compute outputs y 1 , y 2 ,…, y v sequentially; otherwise, P 1 , P 2 ,…, P p–k use N 1 to compute y 1 , y 2 ,…, y v

Phase 2: Initially, P p–k sends y v to all the other PEs. Next, P p–k+1 scatters, i.e., partitions and

distributes, z 1 among all the PEs evenly, each PE having c/p of the c values. All the PEs then

concurrently compute y v+i = y v ⊕ z 1,i , i = 1, 2,…, c, in c/p computation steps. Note that y v+c is

computed by P p .

Phase 2: P 1 initially sends y n/4 to P 2 , P 3 , P 4 in 3 communication steps. Then, P 2 sends 1/4 of the

n/4 prefixes computed in phase 1 to each of the other three PEs in 3 communication steps. That is,

z 1,1 through z 1,n/16 , z 1,n/8+1 through z 1,3n/16 , and z 1,3n/16+1 through z 1,n/4 are sent to P 1 , P 3 , P 4 ,

respectively. Subsequently, the four PEs compute n/4 outputs y n/4+1 , y n/4+2 ,…, y n/2 in n/16 parallel computation steps. At the end, P 4 has y n/2 .

Phase 4: P 4 sends y 3n/4 and 1/4 of the n/4 prefixes computed in phase 1 to each of the other three PEs in 3 communication steps. That is, z 3,1 through z 3,n/16 , z 3,n/16+1 through z 3,n/8 , and z 3,n/8+1

through z 3,3n/16 are sent to P 1 , P 2 , P 3 , respectively. Subsequently, the four PEs concurrently compute n/4 outputs y 3n/4+1 , y 3n/4+2 ,…, y n in n/16 computation steps.

Therefore, the total number of computation steps is

C(n, 4, 3) = (n/4 – 1) + n/16 + n/16 + n/16 = 7n/16 – 1. (1) The total number of communication steps is

R(n, 4, 3) = 3 × 2 + 3 × 2 + 3 = 15.

Next, consider the case when p = 7, k = 3.

Phase 1: Assign N 1 = (x 1 , x 2 ,…, x v ) to the first four PEs, and assign N 2 = (x v+1 , x v+2 ,…, x n ) to the last three PEs. From Eq. (1), we know that P 1 , P 2 , P 3 , P 4 can compute the prefixes of N 1 in C(v, 4, 3)

= 7v/16 – 1 computation steps. In the mean time, P 5 , P 6 , P 7 share N 2 evenly and compute their

respective prefixes concurrently, taking (n – v)/3 – 1 computation steps. By the rule of deciding v, 7v/16 – 1 = (n – v)/3 – 1,

v = 16n/37.

Consequently, c = (n – v)/3 = 7n/37. Thus, the prefixes z 1,1 through z 1,7n/37 are computed in P 5 , z 2,1

through z 2,7n/37 in P 6 , and z 3,1 through z 3,7n/37 in P 7 . Note that P 4 obtains y v = y 16n/37 , and C(16n/37, 4, 3) = 7n/37 – 1.

Phase 2: P 4 initially sends y 16n/37 to the other six PEs in 6 communication steps. Then, P 5 sends 1/7 of the 7n/37 prefixes computed in phase 1 to each of the other six PEs in 6 communication steps.

Subsequently, the seven PEs compute 7n/37 outputs y 16n/37+1 , y 16n/37+2 ,…, y 23n/37 in n/37 parallel computation steps. At the end, P 7 has y 23n/37 .

Phase 3: P 7 initially sends y 23n/37 to the other six PEs in 6 communication steps. Then, P 6 sends 1/7 of the 7n/37 prefixes computed in phase 1 to each of the other six PEs in 6 communication steps.

Subsequently, the seven PEs concurrently compute 7n/37 outputs y 23n/37+1 , y 23n/37+2 ,…, y 30n/37 in n/37 computation steps. At the end, P 7 has y 30n/37 .

Phase 4: P 7 initially sends y 30n/37 and 1/7 of the 7n/37 prefixes computed in phase 1 to each of the other six PEs in 6 communication steps. The seven PEs then concurrently compute 7n/37 outputs y 30n/37+1 , y 30n/37+2 ,…, y n in n/37 computation steps.

Therefore, the total number of computation steps is

C(n, 7, 3) = 7n/37 – 1 + (n/37) × 3 = 10n/37 – 1.

The total number of communication steps is

R(n, 7, 3) = R(16n/37, 4, 3) + 6 × 5 = 45.

In the next section, the numbers of computation steps and communication steps in the general case, as well as other properties of Algorithm A, are derived.

3. Properties of Algorithm A

This section gives the value of v, C(n, p, k), and R(n, p, k). The values of p and k that lead to the smallest C(n, p, k) and R(n, p, k) are also derived.

Let v = α p,k n, where 0 < α p,k < 1. The value of α p,k is given in the following theorem.

Theorem 1: When p = k + 1, α p,k = 1/p; otherwise,

α p,k =

1 1

p kp k

p kp k

− + +

+ + + . Proof: Since the proof is very length, it is given in the appendix.

Theorem 2:

2 ( )

( , , ) 1.

1 n p k C n p k

p kp k

= + −

+ + +

P ₂ ,…, P _p . For ease of presentation, i:j is used to represent the result of computing x _i ⊕ x _i+1 ⊕…⊕ x _j ,

concurrently compute y _v+i = y _v ⊕ z _1,i , i = 1, 2,…, c, in c/p computation steps. Note that y _v+c is

sequentially compute the prefixes of the α _p,k n inputs assigned, and α _p,k n – 1 equals the number of

α ⁻ α

Case 2: p = kq + 1, where q ≥ 2. The number of computation steps in phase 1 required by P ₁

α − = ⁻ α − (2)

α ⁻ α