國立台灣科技大學
資訊工程系
National Taiwan University of Science and Technology
Department of Computer Science and Information Engineering
Families of Parallel Prefix Algorithms for Multicomputers
林彥君 Yen-Chun Lin 洪麗玲 Li-Ling Hung
Technical Report: NTUST-CSIE-07-03
June 2007
Abstract
Four families of parallel prefix algorithms for message-passing multicomputers are presented.
The first family generalizes previous algorithms that use only half-duplex communications. The second family, which is difficult to understand without knowing the first family, improves on the communication time of the first family. The third and fourth adopt collective communication operations to further reduce the communication time of the previous two, respectively. The proposed algorithms have the shortest computation time of all prefix algorithms for the multicomputer models. The precondition of the proposed algorithms is also derived. Each of the four families of algorithms provides the flexibility of choosing either less computation time (and more communication time) or less communication time (and more computation time) to achieve the minimal running time. The key in choosing this and, more importantly, whether the running time is less than other prefix algorithms hinge on the ratio of the time required by a communication step to the time required by a computation step.
Keywords: Collective communication, cost optimality, half-duplex, message-passing
multicomputers, parallel prefix algorithms, computation
1. Introduction
The prefix computation is defined as follows: given n inputs x 1 , x 2 ,…, x n , and an associative binary operator ⊕, compute y i = x 1 ⊕ x 2 ⊕…⊕ x i , for 1 ≤ i ≤ n. For ease of presentation, unless otherwise stated, this study assumes that x i ’s and y i ’s represent inputs and outputs, respectively, and the number of inputs is n. Prefix computation has been extensively studied for its wide application in fields such as biological sequence comparison, cryptography, design of silicon compilers, job scheduling, image processing, loop parallelization, polynomial evaluation, processor allocation, and sorting [1-3, 8, 10, 12, 14, 21-24, 26, 47, 49, 51, 52]. The binary operation ⊕ can be as simple as a Boolean operation or an extremely time-consuming multiplication of matrices [11].
Because of its importance and usefulness, prefix computation has been proposed as a primitive operation [4]. In fact, prefix computation is a built-in operation for Message-Passing Interface (MPI) parallel programming [15], and is implemented in hardware in the Thinking Machines CM-5 [46].
Additionally, many parallel prefix algorithms for various parallel computing models have been proposed [1, 7, 9, 17, 22, 24, 27, 29, 33, 38, 39, 41, 42], and many parallel prefix circuits have also been designed [3, 5, 6, 13, 16, 19, 23-25, 29-32, 34-37, 40, 43, 44, 50-52].
This paper presents four families of computation-efficient parallel prefix algorithms for
message-passing multicomputers with p processing elements (PEs), where p < n. Half-duplex
communication is the weakest communication mode of message-passing multicomputers, with
which each PE of a multicomputer can only send or receive a message in a communication step.
This model of communication is basic and important [20]. Upper bounds of this model apply to other communication models, and lower bounds of this model are upper bounds for lower bounds of other models. Although a PE of a modern multicomputer can send and receive in the same step, it usually takes a longer time to send and receive than to send or receive only due to the inherent hardware capability and software overhead [18, 45]. On a p-PE system, the half-duplex communication ensures that no more than p/2 messages are transferred in a communication step and thus a communication step will not take too much time.
Egecioglu and Koc present a computation-efficient parallel prefix algorithm, henceforth named EK, for the half-duplex multicomputer model with p processing elements (PEs), where p < n [11].
Lin propose an algorithm, henceforth named L, to reduce the communication time on the same model [28]. Our first family of algorithms generalize Algorithm L such that algorithms of the family allow many combinations of the computation time and communication time. The Algorithm L is at one extreme of the family. The others of the family take less computation time and more communication time than Algorithm L. Users can thus take the exact time of performing ⊕ and that of communicating a message into account to choose an algorithm that requires the minimal running time. This study then shows how the communication time of the first family of algorithms can be further decreased to become the second family, which is difficult to understand if the first family is not known.
For ease of programming and efficiency, collective communication operations, such as
broadcast, can achieve the same effect as a sequence of point-to-point operations but in less time [48]. We thus use collective communications to further improve the communication time, and obtain two more families of algorithms from the first two families, respectively.
The rest of this paper is organized as follows. Section 2 presents the first family of parallel prefix algorithms for half-duplex message-passing multicomputers. Section 3 then derives the computation time and communication time, as well as other properties, of the first family. Section 4 shows how the communication time of the first family can be reduced to become the second family;
the new communication time is also derived. Section 5 proposes reducing the communication time by using collective communications to obtain two more families of algorithms. Section 6 shows that all the proposed algorithms are not always effective when p < n, and derives a much stronger precondition of the algorithms. Section 7 compares the new algorithms with previous ones for multicomputers. Conclusions are finally drawn in Section 8.
2. A family of parallel prefix algorithms
This section describes a family of parallel algorithms for solving the prefix problem of n with
using p PEs, where p < n, on the half-duplex multicomputer model. A later section will elaborate the
exact relation between p and n for the algorithms to be useful. The p PEs are represented by P 1 ,
P 2 ,…, P p . For ease of presentation, i:j is used to represent the result of computing x i ⊕ x i+1 ⊕…⊕ x j ,
where i ≤ j. Like Algorithm EK and Lin, the proposed algorithms are most practical when the
amount of time required to perform a binary operation ⊕ is greater than that required to transfer a message between two PEs. This situation may happen, for example, when the binary operation is time-consuming matrix multiplication.
Algorithm A(n, p, k)
{Solving the prefix problem of n inputs, x 1 , x 2 ,…, x n , using p PEs to generate y 1 , y 2 ,…, y n , where n >
p = kq + 1, k ≥ 1, q ≥ 1. For ease of presentation, assume that all numerical values are integers.}
Phase 1: Partition the inputs into two parts N 1 = (x 1 , x 2 ,…, x v ) and N 2 = (x v+1 , x v+2 ,…, x n ), where 0 <
v < n. How the value of v is determined will be explained shortly. If p = k + 1, then P 1 uses N 1 to compute outputs y 1 , y 2 ,…, y v sequentially; otherwise, P 1 , P 2 ,…, P p–k use N 1 to compute y 1 , y 2 ,…, y v
by invoking A(v, p – k, k) recursively. In the mean time, N 2 is first distributed evenly among the other k PEs, P p–k+1 , P p–k+2 ,…, P p ; each of the PEs holds c = (n – v)/k input values. These k PEs then concurrently compute z 1 = (z 1,1 , z 1,2 ,…, z 1,c ), z 2 = (z 2,1 , z 2,2 ,…, z 2,c ),…, z k = (z k,1 , z k,2 ,…, z k,c ), respectively, where z i,j = (v + (i – 1)c + 1):(v + (i – 1)c + j). The value of v is chosen to make the total number of computation steps in this phase required by the first p – k PEs equal to that required by the other k PEs, and it is derived later. Note that y v is obtained by P p–k .
Phase 2: Initially, P p–k sends y v to all the other PEs. Next, P p–k+1 scatters, i.e., partitions and
distributes, z 1 among all the PEs evenly, each PE having c/p of the c values. All the PEs then
concurrently compute y v+i = y v ⊕ z 1,i , i = 1, 2,…, c, in c/p computation steps. Note that y v+c is
computed by P p .
Phase m (m = 3, 4,…, k + 1): Initially, P p sends y v+(m–2)c to all the other PEs. Next, P p–k+m–1 scatters z m–1 among all the PEs evenly, each PE having c/p values. All the PEs then concurrently compute y v+(m–2)c+i = y v+(m–2)c ⊕ z m–1,i , i = 1, 2,…, c, in c/p computation steps. Note that y v+(m–1)c is computed by P p .
Let C(n, p, k) denote the number of computation steps required by Algorithm A(n, p, k), and R(n, p, k) denote the number of communication steps. Like the two previous papers [11, 28], this paper does not take into account the initial input data loading time. To help understand the algorithm, we give two examples in the following. First, consider the case when p = 4, k = 3.
Phase 1: Assign N 1 = (x 1 , x 2 ,…, x v ) to P 1 and N 2 = (x v+1 , x v+2 ,…, x n ) to P 2 , P 3 , P 4 . The prefixes of N 1 are computed by P 1 , and those of N 2 by the other PEs. By the rule of deciding v, the number of computation steps required by P 1 , v – 1, equals the number of parallel computation steps required by the other three PEs, (n – v)/3 – 1. Hence, v = n/4. After phase 1 has completed, P 1 obtains y 1 , y 2 ,…, y n/4 ; P 2 obtains z 1,1 , z 1,2 ,…, z 1,n/4 ; P 3 obtains z 2,1 , z 2,2 ,…, z 2,n/4 ; and P 4 obtains z 3,1 , z 3,2 ,…, z 3,n/4 . It takes n/4 – 1 computation steps.
Phase 2: P 1 initially sends y n/4 to P 2 , P 3 , P 4 in 3 communication steps. Then, P 2 sends 1/4 of the
n/4 prefixes computed in phase 1 to each of the other three PEs in 3 communication steps. That is,
z 1,1 through z 1,n/16 , z 1,n/8+1 through z 1,3n/16 , and z 1,3n/16+1 through z 1,n/4 are sent to P 1 , P 3 , P 4 ,
respectively. Subsequently, the four PEs compute n/4 outputs y n/4+1 , y n/4+2 ,…, y n/2 in n/16 parallel computation steps. At the end, P 4 has y n/2 .
Phase 3: P 4 initially sends y n/2 to the other PEs in 3 communication steps. Then, P 3 sends 1/4 of the n/4 prefixes computed in phase 1 to each of the other three PEs in 3 communication steps. That is, z 2,1 through z 2,n/16 , z 2,n/16+1 through z 2, n/8 , and z 2,3n/16+1 through z 2,n/4 are sent to P 1 , P 2 , P 4 , respectively. Subsequently, the four PEs compute n/4 outputs y n/2+1 , y n/2+2 ,…, y 3n/4 concurrently in n/16 computation steps. At the end, P 4 has y 3n/4 .
Phase 4: P 4 sends y 3n/4 and 1/4 of the n/4 prefixes computed in phase 1 to each of the other three PEs in 3 communication steps. That is, z 3,1 through z 3,n/16 , z 3,n/16+1 through z 3,n/8 , and z 3,n/8+1
through z 3,3n/16 are sent to P 1 , P 2 , P 3 , respectively. Subsequently, the four PEs concurrently compute n/4 outputs y 3n/4+1 , y 3n/4+2 ,…, y n in n/16 computation steps.
Therefore, the total number of computation steps is
C(n, 4, 3) = (n/4 – 1) + n/16 + n/16 + n/16 = 7n/16 – 1. (1) The total number of communication steps is
R(n, 4, 3) = 3 × 2 + 3 × 2 + 3 = 15.
Next, consider the case when p = 7, k = 3.
Phase 1: Assign N 1 = (x 1 , x 2 ,…, x v ) to the first four PEs, and assign N 2 = (x v+1 , x v+2 ,…, x n ) to the last three PEs. From Eq. (1), we know that P 1 , P 2 , P 3 , P 4 can compute the prefixes of N 1 in C(v, 4, 3)
= 7v/16 – 1 computation steps. In the mean time, P 5 , P 6 , P 7 share N 2 evenly and compute their
respective prefixes concurrently, taking (n – v)/3 – 1 computation steps. By the rule of deciding v, 7v/16 – 1 = (n – v)/3 – 1,
v = 16n/37.
Consequently, c = (n – v)/3 = 7n/37. Thus, the prefixes z 1,1 through z 1,7n/37 are computed in P 5 , z 2,1
through z 2,7n/37 in P 6 , and z 3,1 through z 3,7n/37 in P 7 . Note that P 4 obtains y v = y 16n/37 , and C(16n/37, 4, 3) = 7n/37 – 1.
Phase 2: P 4 initially sends y 16n/37 to the other six PEs in 6 communication steps. Then, P 5 sends 1/7 of the 7n/37 prefixes computed in phase 1 to each of the other six PEs in 6 communication steps.
Subsequently, the seven PEs compute 7n/37 outputs y 16n/37+1 , y 16n/37+2 ,…, y 23n/37 in n/37 parallel computation steps. At the end, P 7 has y 23n/37 .
Phase 3: P 7 initially sends y 23n/37 to the other six PEs in 6 communication steps. Then, P 6 sends 1/7 of the 7n/37 prefixes computed in phase 1 to each of the other six PEs in 6 communication steps.
Subsequently, the seven PEs concurrently compute 7n/37 outputs y 23n/37+1 , y 23n/37+2 ,…, y 30n/37 in n/37 computation steps. At the end, P 7 has y 30n/37 .
Phase 4: P 7 initially sends y 30n/37 and 1/7 of the 7n/37 prefixes computed in phase 1 to each of the other six PEs in 6 communication steps. The seven PEs then concurrently compute 7n/37 outputs y 30n/37+1 , y 30n/37+2 ,…, y n in n/37 computation steps.
Therefore, the total number of computation steps is
C(n, 7, 3) = 7n/37 – 1 + (n/37) × 3 = 10n/37 – 1.
The total number of communication steps is
R(n, 7, 3) = R(16n/37, 4, 3) + 6 × 5 = 45.
In the next section, the numbers of computation steps and communication steps in the general case, as well as other properties of Algorithm A, are derived.
3. Properties of Algorithm A
This section gives the value of v, C(n, p, k), and R(n, p, k). The values of p and k that lead to the smallest C(n, p, k) and R(n, p, k) are also derived.
Let v = α p,k n, where 0 < α p,k < 1. The value of α p,k is given in the following theorem.
Theorem 1: When p = k + 1, α p,k = 1/p; otherwise,
α p,k =
2 2
1 1
p kp k
p kp k
− + +
+ + + . Proof: Since the proof is very length, it is given in the appendix.
Theorem 2:
2
2 ( )
( , , ) 1.
1 n p k C n p k
p kp k
= + −
+ + +
Proof: We consider two cases separately.
Case 1: p = k + 1. In phase 1 of Algorithm A(n, p, k), P 1 takes α p,k n – 1 computation steps to
sequentially compute the prefixes of the α p,k n inputs assigned, and α p,k n – 1 equals the number of
computation steps required by the other k PEs. In phases 2 through k + 1, totally (n – α p,k n)/p
computation steps are required to compute n – α p,k n values, precisely y v+1 , y v+2 ,…, y n , by all the p
PEs concurrently. Hence,
,
( , , )
p k,1 n
p kn .
C n p k n
p
α − α
= − +
From Theorem 1,
,1
p k
p
α = ; thus,
2
/ 2 1
( , , ) n 1 n n p p 1.
C n p k n
p p p
− −
= − + = −
Since p = k + 1, we have
2
2
2
2 1
( , , ) 2 1
2
2 1
( 1)
2 ( )
1 1.
C n p k n p p
p k n p p k n p k
p kp k
= − −
= + −
+ +
= + −
+ + +
Case 2: p = kq + 1, where q ≥ 2. The number of computation steps in phase 1 required by P 1
through P p–k equals that required by P p–k+1 through P p . That is,
,
(
p k,, , ) n
p kn 1.
C n p k k
k
α − = − α − (2)
As mentioned in case 1, phases 2 through k + 1 requires a total of (n – α p,k n)/p computation steps. Hence,
,
( , , ) (
p k,, , ) n
p kn .
C n p k C n p k k
p
α − α
= − +
By Eq. (2), this becomes
, ,
( )(1
,)
( , , ) n
p kn 1 n
p kn n p k
p k1.
C n p k
k p kp
α α α
− − + −
= − + = − (3)
By Theorem 1,
2
, 2
1 .
p k
1
p kp k
p kp k
α = − + +
+ + +
Eq. (3) then becomes
2 2
2
( )(1 1 )
2 ( )
( , , ) 1 1 1.
1
p kp k
n p k
n p k
p kp k
C n p k
kp p kp k
− + +
+ −
+
+ + +
= − = −
+ + + Q.E.D.
We can then consider the impact of the values of p and k on the computation time. Let a = bq 1
+ 1, d = eq 2 + 1, and b, e, q 1 , q 2 ≥ 1. Then
2 2
2 2
2 2
2 ( ) 2 ( )
( , , ) ( , , )
1 1
( )( 1) ( 1)( )
2 .
( 1)( 1)
n a b n d e
C n a b C n d e
a ab b d de e
a b d de e a ab b d e
n a ab b d de e
+ +
− = −
+ + + + + +
+ + + + − + + + +
= + + + + + +
The numerator above can be rewritten
ad 2 + ade + ae + a + db 2 + bde + be + b − (a 2 d + abd + bd + d + a 2 e + abe + be + e)
= ad(d − a) + ad(e − b) + (ae − bd) + (a − d) + (bd 2 − a 2 e) + be(d − a) + (b − e).
If a = d, the numerator becomes
a 2 (e − b) + a(e − b) + a 2 (b − e) + (b − e) = (e − b)(a − 1).
Clearly, if e > b, then the numerator is positive, and thus C(n, a, b) > C(n, a, e), i.e., A(n, p, e) takes
less computation time than A(n, p, b).
On the other hand, if b = e, the numerator becomes
ad(d − a) + b(a − d) + (a − d) + b(d 2 − a 2 ) + b 2 (d − a)
= (d − a)(ad − b − 1 + ab + bd + b 2 ).
Clearly, if d > a, then the numerator is positive, and thus C(n, a, b) > C(n, d, b).
To summarize, if d > a and e > b, then C(n, a, b) > C(n, d, b) > C(n, d, e). This means that
using as many PEs as possible and the maximal k, which equals p – 1, can achieve the minimal
computation time.
Theorem 3: R(n, p, 1) = p(p − 1);
R(n, p, k) = (2k − 1)(p – 1)(p + k – 1)/2k when k ≥ 2.
Proof: From Algorithm A, R(n, p, k) is the sum of the following components: (i) the number of communication steps, R(v, p – k, k), when the first p – k PEs perform A(v, p – k, k) in phase 1; (ii)
the number of communication steps required by P p–k to send y v to the other p – 1 PEs in phase 2; (iii)
the number of communication steps required by P p to send y v+ic to the other p – 1 PEs in phase i + 2,
for i = 1, 2,…, k – 1; and (iv) the number of communication steps taken to distribute a total of kc, or
n – v, z i,j values evenly among all the p PEs in phases 2 through k + 1. We consider the following
three cases.
Case 1: k = 1. The algorithm has only two phases, and it degenerates into Algorithm L. It has
been shown [28]
R(n, p, 1) = p(p – 1).
Case 2: k ≥ 2 and p = k + 1. Component (i) is 0; component (ii) is p – 1; component (iii) is
(k – 1)(p – 1), and component (iv) is k(p – 1). Note that in phase k + 1, component (iii) has the same
communication source and destinations as component (iv). These two components can become one
in phase k + 1; that is, p − 1 communication steps can be reduced. Hence,
R(n, p, k) = 0 + (p – 1) + (k – 1)(p – 1) + k(p – 1) − (p − 1)
= (p – 1)(2k – 1), (4)
which is equal to (2k – 1)(p – 1)(p + k – 1)/2k since p = k + 1.
Case 3: k ≥ 2 and p = k(i + 1) + 1, where i ≥ 1. Component (i) is R(v, p – k, k), which is
required by the first p – k PEs to communicate the prefixes of v inputs. Components (ii), (iii), and
(iv) are the same as those in case 2. Thus, using the result of case 2, Eq. (4), we have
R(n, p, k) = R(α p,k n, p – k, k) + (p – 1)(2k – 1)
= R(α p–k,k α p,k n, p – 2k, k) + (p – k – 1)(2k – 1) + (p – 1)(2k – 1)
= …
= R(α p–(i–1)k,k …α p–k,k α p,k n, p – ik, k) + (2k – 1)[(p – (i – 1)k – 1) +...+ (p – k – 1) + (p – 1)]
= R(α p–(i–1)k,k …α p–k,k α p,k n, p – ik, k) + (2k – 1)[i(p – 1) – (k + 2k +… + (i – 1)k)]
= R(α 2k+1,k …α p–k,k α p,k n, k + 1, k) + (2k – 1)[i(p – 1) – ( 1) 2 ki i −
]
= R(α 2k+1,k …α p–k,k α p,k n, k + 1, k) + 1 2 1
(2 1)[ ( 1) ( 1) ].
2
p k p k
k p p k
k k
− − − −
− − − − −
Since Eq. (4) can be rewritten
R(n, k + 1, k) = k (2k – 1),
which is independent of the value of n, we obtain
R(n, p, k) = k (2k – 1) + (2k – 1)
2 2 2
2 1 2 1 3 3 2
( )
2
p p kp k p p kp k k
k k
− + − + − + − + +
−
= (2k – 1)
2 2 2 2
2 2 4 2 2 2 2 1 3 3 2
2
k p p kp k p p kp k k
k
+ − + − + − + − + − −
= (2k – 1)(p – 1)(p + k – 1)/2k. Q.E.D.
Theorem 3 reveals the effect of p and k on the communication time. Clearly, a smaller p
results in less communication time. To see the effect of k, we first note that when k ≥ 2, a smaller k
leads to less communication time. In addition, R(n, p, 2) = 3(p – 1)(p + 1) > R(n, p, 1). Thus, a
smaller k also results in less communication time.
However, as already mentioned, a larger k or p leads to less computation time. Thus, it is not
straightforward to decide the best values of p or k that can achieve the least running time. Let σ be
the ratio of the time required by a communication step to the time required by a computation step.
The total running time of the algorithm is proportional to C(n, p, k) + σR(n, p, k), which should be
minimized by choosing appropriate p and k values to solve the prefix problem as fast as possible.
From Theorem 2, C(n, p, k) = Θ(n/p); from Theorem 3, R(n, p, k) = Θ(p 2 ). Totally, Algorithm
A takes Θ(n/p) + Θ(p 2 ) time. When n = Ω(p 3 ), n/p = Ω(p 2 ), and thus Θ(n/p) + Θ(p 2 ) = Θ(n/p). Since
the sequential solution for the prefix problem takes Θ(n) time, Algorithm A is cost optimal when n =
Ω(p 3 ).
4. Reducing communication time
This section proposes modifying Algorithm A when p ≥ 2k + 1 and k ≥ 2, to obtain a faster
algorithm named B. Recall that in the first phase of Algorithm A, the value of v is chosen to make
the number of computation steps required by the first p – k PEs equal to the number of computation
steps required by the other k PEs. Note that when p ≥ 2k + 1 and k ≥ 2, the first p – k PEs
communicate among themselves in phase 1, but the last k PEs perform no communication
operations; thus, the last k PEs are idle for the amount of time that the first p – k PEs take to
communicate.
To reduce this idle time, we note that in Algorithm A there are communications among the last
k PEs in phases 2 through k + 1, which can be performed in phase 1, but the communications
involving the first p – k PEs in phases 2 through k + 1 should not be moved to phase 1. The original
communications in each of these k phases that can be moved to phase 1 are transfers from a PE to
k – 1 PEs. Thus, totally k(k − 1) communication steps can be moved to phase 1.
However, we need to make sure whether all or only part of the k(k – 1) communication steps
should be moved. Comparing k(k – 1) with the number of communication steps in phase 1 of
Algorithm A, which is R(v, p – k, k), can clarify this. If R(v, p – k, k) > k(k – 1), the k(k – 1)
communication steps can all be performed in phase 1 without increasing the communication time.
From Theorem 3,
R(v, p – k, k) = (2k − 1)(p − 1)(p + k − 1)/2k.
Since p ≥ 2k + 1, we have
R(v, p – k, k) ≥ (2k − 1)(2k)(3k)/2k = 3k(2k − 1).
Thus,
R(v, p – k, k) – k(k – 1) ≥ 6k 2 – 3k – k 2 + k = k(5k – 2) > 0.
That is, R(v, p – k, k) > k(k − 1). Therefore, Algorithm B takes less communication time than
Algorithm A when p ≥ 2k + 1 and k ≥ 2. For the other valid values of p and k, the two algorithms are
the same and thus take the same amount of communication time. Note that Algorithm B takes the
same computation time as Algorithm A.
Let S(n, p, k) denote the number of communication steps required by Algorithm B(n, p, k).
When p ≥ 2k + 1 and k ≥ 2, S(n, p, k) is the sum of the following components: (i) S(v, p – k, k)
communication steps required by the first p – k PEs invoking B(v, p – k, k) in phase 1; (ii) p – 1
communication steps required by P p–k to send y v to the other p – 1 PEs in phase 2; (iii) (k – 1)(p – 1)
communication steps required by P p to send y v+ic to the other p – 1 PEs in phase i + 2, for i = 1, 2,…,
k – 1; and (iv) k(p – k) communication steps taken to distribute part of the z i,j values evenly among
P 1 , P 2 ,… P p–k in phases 2 through k + 1. Note that in phase k + 1, since the communications that
contribute to component (iii) and those that contribute to component (iv) are all sent out by P p , the
latter communications can be merged to the former ones, eliminating p − k communication steps in
this phase. Let p = k(i + 1) + 1, where i ≥ 1. Hence,
S(n, p, k) = S(α p,k n, p – k, k) + (p – 1) + (k – 1)(p – 1) + k (p – k) – (p – k)
= S(α p,k n, p – k, k) + k(p – 1) + (k – 1)(p – k)
= S(α p–k,k α p,k n, p – 2k, k) + k(p – k – 1) + (k – 1)(p – 2k) + k(p – 1) + (k – 1)(p – k)
= …
= S(α p–(i–1)k,k …α p–k,k α p,k n, p – ik, k) + k[(p – (i – 1)k – 1) +...+ (p – k – 1) + (p – 1)]
+ (k – 1)[(p – ik) +...+(p – 2k) + (p – k)]
= S(α p–(i–1)k,k …α p–k,k α p,k n, p – ik, k) + k[i(p – 1) – (k + 2k +… + (i – 1)k)]
+ (k – 1)[ip – (k + 2k +...+ ik)]
= S(α p–(i–1)k,k …α p–k,k α p,k n, p – ik, k) + k ( 1)
[ ( 1) ]
2 i p ik i −
− − + (k – 1) ( 1)
[ ]
2 ip ik i +
−
= S(α 2k+1,k …α p–k,k α p,k n, k + 1, k) + 1 ( 1) 2 1
[ ( 1) ]
2
p k p k p k
k p
k k
− − − − − −
− −
1 ( 1) 1
( 1)[ ]
2
p k p k p
k p
k k
− − − − −
+ − − .
Since S(n, p, k) = R(n, p, k) when p = k + 1, from Theorem 3, when k ≥ 2,
S(n, k + 1, k) = R(n, k + 1, k) = k(2k – 1).
Thus,
S(n, p, k) = k(2k – 1) + 1 ( 1) 2 1
[ ( 1) ]
2
p k p k p k
k p
k k
− − − − − −
− −
1 ( 1) 1
( 1)[ ]
2
p k p k p
k p
k k
− − − − −
+ − −
= k(2k – 1) + 2 1 2 k
k
− p 2 +
2 2
2 2
k k k k
k
− − +
p +
3 2 2
2 1
2
k k k k
k
− + − − +
= 2 1
21
21
2 .
2 2 2
k k
p p k k
k k
− +
− + − +
Clearly, S(n, p, k) = Θ(p 2 ). Therefore, the discussion of cost optimality of Algorithm A applies
to Algorithm B. That is, Algorithm B is also cost optimal when n = Ω(p 3 ).
5. Using collective communications
This section considers reducing the communication time of algorithms A and B when
collective communications are available. Let us first consider modifying Algorithm A to obtain a
new one named E. In phase i + 2, 0 ≤ i ≤ k – 1, of Algorithm A, the p – 1 transfers of y v+ic can be
replaced by a single broadcast to achieve the same effect. In addition, scattering of z i in phase i + 1,
1 ≤ i ≤ k, can be done with a single scatter operation.
With a suitable implementation, either a broadcast or scatter operation takes Θ(log p) time to
send to p PEs on a multicomputer. Thus, broadcasting y v , y v+c ,…, y v+(k–1)c is equivalent to take
c 1 k log p steps, where c 1 is a constant, and k scatter operations take c 2 k log p steps, where c 2 is a
constant. Let c = c 1 + c 2 and p = k(i + 1) + 1, where i ≥ 0 . We use T(n, p, k) to denote the number of communication steps required by Algorithm E(n, p, k), and consider the following three cases to
examine T(n, p, k).
Case 1: k ≥ 2 and p = k + 1. The communications are k broadcast operations and k scatter
operations. Thus,
T(n, p, k) = ck log p = c(p – 1) log p = Θ(p log p).
Case 2: k ≥ 2 and p ≥ 2k + 1.
T(n, p, k) = T(α p,k n, p – k, k) + ck log p
= T(α p–k,k α p,k n, p – 2k, k) + ck log (p – k) + ck log p
= ...
= T(α p–(i–1)k,k …α p–k,k α p,k n, p – ik, k) + ck
1
0
log ( )
i
j
p jk
−
=
∑
−
= T(α 2k+1,k …α p–k,k α p,k n, k + 1, k) + ck
( 2 1) /
0
log ( )
p k k
j
p jk
− −
=
∑
− .
Case 1 gives T(n, k + 1, k) = ck log (k + 1), and ck
( 2 1) /
0
log ( )
p k k
j
p jk
− −
=
∑
− < ck
1
log
p
i
i
=
∑
= Θ(p log p).
Thus, T(n, p, k) = O(p log p).
Case 3: k = 1. Similar to the derivation of T(n, p, k) in Case 2, it has been obtain that T(n, p, 1)
= Θ(p log p) [28].
Thus, we have T(n, p, k) = O(p log p). Moreover, since Theorem 2 gives C(n, p, k) = Θ(n/p),
Algorithm E takes Θ(n/p) + O(p log p) time. If n = Ω(p 2 log p), then n/p = Ω(p log p), and thus
Θ(n/p) + O(p log p) = Θ(n/p). Since the sequential solution for the prefix problem takes Θ(n) time,
the algorithm is cost optimal when n = Ω(p 2 log p).
The modification to Algorithm B by using broadcast and scattering is very similar to the above
modification, and results in the same communication time complexity. Thus, the resulting algorithm
is also cost optimal when n = Ω(p 2 log p).
6. Precondition of algorithms
This section shows that a stronger relation than p < n is required to use the proposed
algorithms effectively. Indeed, a stronger relation among n, p, and k is needed. Before going into the
general case, we first use two examples to shed some light.
Suppose n = 1024, p = 256, and k = 85. From Theorem 2, we have C(1024, 256, 85) < 7.
However, it is even impossible to compute the sum of 1024 inputs in 7 computation steps. Therefore,
Theorem 2 does not hold under this situation.
As a more general case, suppose n = 1024, p = 256 = kq + 1, and q ≥ 2. From Theorem 1, v =
1024(256 2 – 256k + k + 1)/(256 2 + 256k + k + 1) inputs are assigned to the first 256 – k PEs, and n –
v = 2kpn/(p 2 + kp + k + 1) = 512×1024k /(256 2 + 256k + k + 1) < 8k inputs are assigned to the last k
PEs. That is, each of the last k PEs has at most 8 inputs. Then, in phase 2, P 257–k scatters at most 8
values to at most 8 of the 256 PEs for further computation, and thus at least 248 PEs are idle when
the others are communicating and computing. This ineffective use of PEs also happens in phases 3
through k + 1.
Thus, in phase 1 at least kp inputs should be assigned to the last k PEs, which guarantees that
in later phases each PE can be assigned at least one value to compute. Using n – v ≥ kp and
Theorem 1, we see that
n – n(p 2 – kp + k + 1)/(p 2 + kp + k +1) ≥ kp,
n ≥ (p 2 + kp + k + 1)/2.
This is the stronger precondition for running the presented algorithms effectively.
7. Comparison with other algorithms
Since Algorithm L is a special case of A(n, p, k), precisely A(n, p, 1), it takes C(n, p, 1)
computation steps and R(n, p, 1) communication steps. We have shown that a larger k results in less
computation time, i.e., C(n, p, 1) > C(n, p, k) for k ≥ 2. Specifically,
C(n, p, 1) – C(n, p, k) = 2n(k – 1)(p – 1)/[(p 2 + p + 2)(p 2 + kp + k + 1)] > 0.
We have also shown that a larger k results in more communication time. To compare the
number of communication steps of A(n, p, k) and A(n, p, 1), the difference between R(n, p, 1) and
R(n, p, k), where k ≥ 2, is
R(n , p, 1) – R(n, p, k) = (p 2 – p) – (2k – 1)(p – 1)(p + k – 1)/2k
= (p – 1)(p – (2k 2 – 3k + 1))/2k.
Therefore, R(n, p, k) ≤ R(n, p, 1) when p ≥ 2k 2 – 3k + 1 and k ≥ 2. Taking both the computation and
communication times into account, when p < 2k 2 – 3k + 1, we must know the ratio of the time
required by a computation step to the time required by a communication step to decide which
algorithm is faster, but when p ≥ 2k 2 – 3k + 1 and k ≥ 2, A(n, p, k) is definitely faster.
Algorithm B can also be faster than Algorithm L. Precisely, when k ≥ 2,
R(n , p, 1) – S(n, p, k) = (p 2 – p) – ((2k – 1)p 2 /2k – p/2 + k 2 – 2k + (k + 1)/2k)
= (p 2 – kp – (2k 3 – 4k 2 + k + 1))/2k.
Therefore, when p 2 – kp > 2k 3 – 4k 2 + k + 1 and k ≥ 2, Algorithm B is faster than Algorithm L.
Lin and Lin present a parallel prefix algorithm named PLL for the half-duplex multicomputer
[33]; PLL requires 2n/p + 1.44 log 2 p – 1 computation steps and 1.44 log 2 p + 1 communication
steps when using p PEs, where 10 ≤ p < n. The number of computation steps of Algorithm A or B is
less than that of PLL, but the number of communication steps is greater than that of PLL.
Parallel algorithms requiring 2n/p + (m + 1) log m+1 p – 2 computation steps and log m+1 p
communication steps on an m-port multicomputer of p < n PEs have also been presented [38]. With
m-port communication, each PE has m input ports and m output ports to communicate with other
PEs in one communication step, for some m ≥ 1. The number of communication steps of Algorithm
E is larger than those of the previous algorithms. Thus, Algorithm E must have a C(n, p, k) that is
small enough for it to be faster than the other algorithms. Because all the other algorithms need
more than 2n/p computation steps, Algorithm E may be faster than the others when the computation
improvement is good enough to compensate for the communication disadvantage.
Note that the ratio of the time required by a computation step to the time required by a
communication step has an impact on whether a new algorithm is faster. If the ratio is large, which
may be true when the binary operation ⊕ is matrix multiplication, the new algorithms are more probable to be faster than previous ones.
8. Conclusion
This paper has presented two families of parallel algorithms, A(n, p, k) and B(n, p, k), run on
half-duplex multicomputers with p PEs to solve the prefix problem of n inputs, where n ≥ (p 2 + kp +
k + 1)/2. The numbers of computation steps and communication steps have been derived. The new
algorithms are cost optimal when n = Ω(p 3 ). When p = k + 1, the proposed algorithms take the least
computation time and the most communication time among all prefix algorithms for
multicomputers. For many possible values of the ratio of the time required by a communication step
to the time required by a computation step, the user may find one of the presented algorithms takes
the minimal running time.
Either of the two families of algorithms can be transformed into another family by adopting
broadcast and scatter operations to reduce the communication time. The resulting two families are
cost optimal when n = Ω(p 2 log p).
For all the presented algorithms, the last k PEs are idle for some amount of time in phase 1.
Thus, in phase 1, if we assign fewer than α p,k n inputs to the first p – k PEs and thus more than
n(1 – α p,k ) inputs to the last k PEs, then the first p – k PEs can do fewer computations and the last k
PEs can do more. Thus, the idle time can be eliminated. Note that since the first p – k PEs now take
less computation time in phase 1, this phase can take less time. How many inputs should be
assigned to the first p – k PEs is an open problem.
Each of the four families of algorithms provides the flexibility of choosing either less
computation time (and more communication time) or less communication time (and more
computation time) to achieve the minimal running time. The key in choosing this and, more
importantly, whether the running time is less than other prefix algorithms hinge on the ratio of the
time required by a communication step to the time required by a computation step.
Appendix. Proof of Theorem 1 We consider two cases separately.
Case 1: p = k + 1. In phase 1 the number of computation steps required by P 1 equals the
number of parallel computation steps required by the other k PEs. That is,
,
,
1
p k1.
p k
n n
n k
α − = − α −
Thus,
,
1 1
1 .
p k
k p
α = =
+ (5) Case 2: p = kq + 1, where q ≥ 2. The number of computation steps in phase 1 required by P 1
through P p–k equals that required by P p–k+1 through P p . That is,
,
(
p k,, , ) n
p kn 1.
C n p k k
k
α − = − α − (6)
Phases 2 through k + 1 require a total of (n – α p,k n)/p computation steps. Hence,
,
( , , ) (
p k,, , ) n
p kn
C n p k C n p k k
p
α − α
= − +
, ,
( )(1
,)
1 1.
p k p k p k
n n n n n p k
k p kp
α α α
− − + −
= − + = − (7)
Substituting n and p in Eq. (7) with α p,k n and p − k, respectively, we obtain
,
, ,
(1 )
( , , ) 1.
( )
p k k
p k p k
C n p k k n p
k p k
α − = α − α
−−
− (8) From Eqs. (6) and (8) we have
,
(
, , ,)
1 1,
( )
p k p k p k k p k
n n p n n
k k p k
α α α
−α
− −
− = −
−
,
,
2 .
p k
p k k
p k
p k p
α α
−= −
− − (9)
Then, let r i = ip – ki(i + 1)/2 and s i = (i + 1)p – ki(i + 1)/2. We prove by induction on i that
1 , ,
1 ,
i i p ki k
,
p k
i i p ki k
r r
s s
α α
α
− −
− −
= −
− for i ≥ 1. (10)
Base step: Since r 0 = 0, s 0 = p, r 1 = p – k, and s 1 = 2p – k, we have
1 0 ,
1 0 , ,
2 .
p k k
p k k p k k
r r p k
s s p k p
α
α α
−
− −
− −
− = − −
From Eq. (9), we have
1 0 ,
,
1 0 ,
p k k
.
p k
p k k
r r
s s
α α
α
−
−
= −
− Induction step: Assume that
1 ,
,
1 ,
t t p kt k
p k
t t p kt k
r r
s s
α α
α
− −
− −
= −
− , (11) we are to show that
1 ( 1),
,
1 ( 1),
t t p k t k
p k
t t p k t k
r r
s s
α α
α
+ − +
+ − +
= −
− . (12) Substituting p in Eq. (9) with p – kt, we obtain
,
( 1),
2( ) ( ) .
p kt k
p k t k
p kt k
p kt k p kt
α
−α
− +− −
= − − − −
Thus, Eq. (11) can be rewritten
1
( 1), ,
1
( 1),
2 2 ( )
2 2 ( )
t t
p k t k p k
t t
p k t k
p kt k r r
p kt k p kt
p kt k
s s
p kt k p kt
α α
α
−
− +
−
− +
− −
− − − − −
= − −
− − − − −
( 1), 1
( 1), 1
(2 2 ( ) ) ( )
(2 2 ( ) ) ( )
t p k t k t
t p k t k t
r p kt k p kt r p kt k
s p kt k p kt s p kt k
α α
− + −
− + −
− − − − − − −
= − − − − − − −
1 ( 1),
1 ( 1),
(2 2 ) ( ) ( )
(2 2 ) ( ) ( )
t t t p k t k
t t t p k t k
r p kt k r p kt k r p kt
s p kt k s p kt k s p kt
α α
− − +
− − +
− − − − − − −
= − − − − − − − . (13)
By definition, r t+1 = (t + 1)p – k(t + 1)(t + 2)/2, r t = tp – kt(t + 1)/2, and r t–1 = (t – 1)p – k(t – 1)t/2.
These lead to
r t−1 = r t – p + kt,
r t+1 = r t + p – kt – k.
Thus,
r t (2p – 2kt – k) – r t−1 (p – kt – k) = r t (2p – 2kt – k) – (r t – p + kt) (p – kt – k)
= (p – kt)(r t + p – kt – k)
= (p – kt)r t+1 . (14)
By definition, s t+1 = (t + 2)p – k(t + 1)(t + 2)/2, s t = (t + 1)p – kt(t + 1)/2, and s t–1 = tp – k(t – 1)t/2.
These lead to
s t−1 = s t – p + kt,
s t+1 = s t + p – kt – k.
Thus,
s t (2p – 2kt – k) – s t−1 (p – kt – k) = s t (2p – 2kt – k) – (s t – p + kt) (p – kt – k)
= (p – kt)(s t + p – kt – k)
= (p – kt)s t+1 . (15)
Using Eqs. (14) and (15), we see that Eq. (13) can be rewritten
1 ( 1),
,
1 ( 1),
( ) ( )
( ) ( ) .
t t p k t k
p k
t t p k t k
p kt r r p kt p kt s s p kt α α
α
+ − +
+ − +
− − −
= − − −
This can be reduced to Eq. (12), and thus proves Eq. (10).
Setting i = (p – k – 1)/k for Eq. (10), we have
1 1,
, i i k k
,
p k
r r s s α α
α
− +
= −
− where i = (p – k – 1)/k.
From Eq. (5), α k+1,k = 1/(k + 1). In addition, the definition of r i implies r i = r i–1 + p – ki, and the
definition of s i implies s i = s i–1 + p – ki. Thus,
1 1
,
1 1
1 1
i i
p k
i i
r p ki r k s p ki s
k α
−
−
−
−
+ − −
= +
+ − −
+
1
1
1 1 , where = ( 1) / . 1 1
i
i
k r k
k i p k k
k s k
k
−
−
+ + +
= − −
+ + +
(16)
Since i – 1 = (p – 2k – 1)/k, from the definitions of r i and s i , we have
1
1
2 1 2 1 1
2 ,
1 2 1 1
2 .
i
i
p k k p k p k
r p
k k k
p k k p k p k
s p
k k k
−
−
− − − − − −
= −
− − − − − −
= −
Thus, Eq. (16) can be written
,
2 1 2 1 1
( ) 1
1 2
1 2 1 1
( ) 1
1 2
p k
k p k k p k p k
p k
k k k k
k p k k p k p k
p k
k k k k
α
− − − − − −
− + +
= +
− − − − − −
− + +
+
2 2
2 2
2 3 1
1 2 1
2 3 1
1 2 1
k p pk k k
k k k
k p pk k k
k k k
− − − −
+ + +
= + − − −
+ + +
2 2
2 2
2 3 1
2( 1) 1
2 3 1
2( 1) 1
p pk k k
k k
p pk k k
k k
− − − −
+ + +
= + − − −
+ + +
2 2
2 2
2 3 1 2( 1)( 1)
2 3 1 2( 1)( 1)
p pk k k k k
p pk k k k k
− − − − + + +
= + − − − + + +
2 2