A Parallel Implementation of the Message-Passing Decoder of LDPC Codes Using a Reconfigurable Optical Model

(1)

A Parallel Implementation of the Message-Passing Decoder of LDPC Codes Using a Reconfigurable Optical Model

Sharareh Babvey

Department of Computer Science, Georgia State University

Atlanta, GA 30302-3994 sbabvey1@student.gsu.edu

José Alberto Fernández-Zepeda Department of Computer Science

CICESE

Ensenada, B. C. 22860, Mexico fernan@cicese.mx

Anu G. Bourgeois

Department of Computer Science Georgia State University

Atlanta, GA 30302-3994 abourgeois@cs.gsu.edu

Steven W. McLaughlin

School of Elec. and Comp. Engineering Georgia Institute of Technology

Atlanta,GA 30332 swm@ece.gatech.edu

Abstract

In this paper we propose a constant-time algorithm for parallel implementation of the message-passing decoder of Low Density Parity Check (LDPC) codes on the Linear Array with a Reconfigurable Pipelined Bus System (LARPBS), achieving the minimum number of processors required for a fully parallel implementation. Dynamic reconfiguration provides flexibility to code changes and efficient message routing. To decode a different code, we may simply set up the required connections between the bit- nodes and check-nodes by modifying the initialization phase of the LARPBS algorithm. No extra wiring or hardware changes are required, as compared to other existing approaches. Moreover, the same hardware can implement the decoder in both probability and logarithm domains. The LARPBS also allows reducing the number of the bus cycles required for processor communications to a small constant, regardless of the code length. We illustrate that the LARPBS is an efficient and fast model for implementing the decoder.

Index Terms—Reconfigurable architectures, optical buses, LDPC codes, message-passing decoder.

1. Introduction

Low-Density Parity-Check (LDPC) codes, introduced by Gallager [1, 2] are error control codes defined by sparse parity-check matrices. These efficient block codes have attracted a lot of attention due to their remarkable bit error rate and signal to noise ratio (SNR) performance and elegant decoding scheme. The message-passing decoder is a

competent iterative algorithm for decoding LDPC codes [1- 3]. This decoder is based on bit-nodes (representing the codeword bits) and check-nodes (representing the parity constraints) communicating with each other by computing and sending messages. The computations may be in probability or logarithm domain. The time complexity of the decoder grows linearly with the code length. This is considerable, as efficient codes have long codewords.

There are a number of parallel implementations of the LDPC decoder in the literature that implement the decoder of an LDPC code in constant time. However, these approaches dedicate specific hardware to each node, while practically bit-nodes and check-nodes act in serial and may be implemented by the same hardware without increasing the time complexity. Moreover, these methods employ fixed connections between a bit-node and the check-nodes connected to it. Thus, decoding a different code (defined by a different parity-check matrix) requires modifying the internal connections between the nodes. An efficient parallel decoder capable of decoding any given LDPC code continues to be a challenge, particularly when fast data exchange among the nodes is required [4-6].

An important feature of a reconfigurable bus is its ability to create different interconnection topologies or its flexibility to data and problem changes. Several processor arrays based on reconfigurable pipelined optical buses have been proposed as practical parallel computing platforms.

The Linear Array with a reconfigurable Pipelined Bus System (LARPBS) is one such model [7, 8]. These models transmit messages concurrently on a bus in a pipelined fashion and can dynamically reconfigure a bus under program control to suit the communication needs [7-12].

(2)

Such systems are very efficient due to the high bandwidth available by pipelining messages [7, 10, 12].

In this paper we propose a constant-time LARPBS algorithm for implementing the message-passing decoder in logarithm domain. We illustrate that the LARPBS exploits hardware reuse and achieves the minimum number of processors, required for a fully parallel implementation. We may design the decoder in probability domain with some minor computational modifications in the LARPBS algorithm. To setup the required connections between the bit-nodes and check-nodes as specified by the parity-check matrix, we only need to modify the initialization phase of the LARPBS algorithm. Thus, designing a decoder for different LDPC codes involves no hardware modifications.

This paper is organized as follows. Section 2 provides a brief introduction to the LARPBS. Section 3 introduces the LDPC codes. Section 4 presents the proposed LARPBS algorithm and Section 5 concludes the paper.

2. The Linear Array with a Reconfigurable Pipelined Bus System (LARPBS)

The Linear Array with a Reconfigurable Pipelined Bus System (LARPBS) is an optical model with three waveguides [7]. One waveguide is for carrying data (the data bus), and the other two are for addressing (the reference and select buses). Figure 1 shows the LARPBS with five processors. Each processor is connected to the buses by directional couplers. The processor at the U-turn (P4) is the head of the bus. Each processor can transmit one message to one or multiple processors, and can receive one message per bus cycle. A bus cycle is the round trip propagation time for end-to-end communication on the bus.

Figure 1 The structure of an LARPBS [7].

The data and reference buses have an extra segment of fiber between consecutive processors, on the receiving segment. These are called fixed delay loops (see the lower loops on Figure 1). The select bus has conditional delays on the transmitting segment between each pair of processors (see the upper loops on Figure 1). Processor Pi controls the conditional delay between processors Pi-1and Pi, for 1 d i d N. The fixed and conditional delays are all of one unit pulse-length. The LARPBS also has switches on both the transmitting and receiving segments from each processor on all three waveguides. These switches enable the LARPBS to segment the buses and form a number of smaller LARPBS

arrays. Each processor locally controls the set of segment switches to its right.

There are several addressing schemes for this model. We use the coincident pulse technique, which is the most flexible and common addressing technique. It exploits the relative time delay between the signals that the processors inject on select and reference buses. When these pulses coincide at a receiver, the processor reads the corresponding data. Optical signal transmission is unidirectional and has predictable propagation delay. Thus, we are able to employ synchronized concurrent access to the optical bus in a pipelined fashion [7, 9].

Throughout this paper, we assume that no delay is introduced on the select bus. We also use the priority concurrent write rule for multiple messages arriving at one processor during one bus cycle. According to this rule, the processors with higher indices have higher priorities and a processor reads only the first received message and ignores the rest of the messages. This is the processor with the highest priority in its array.

3. LDPC Codes

Low Density Parity Check (LDPC) codes are linear binary block codes defined by a sparse parity-check matrix [1-3]. An Mu N parity-check matrix defines an LDPC code of length N, where each codeword satisfies M parity-check constraints.

Definition 1: A parity-check matrix HMuN, with exactly k ones in each row, and exactly j < k ones in each column defines a regular (j, k) LDPC code, where j and k are small constants compared to N. Matrix H is binary and sparse. In other words, only a small fraction of the entries of H are one and the rest are zero [1-3].

For example, the parity-check matrix H, in Relation 1 defines a regular (2, 4) LDPC code. Each codeword c = (c1

c2c3c4c5c6) is of length 6, and satisfies three parity-check constraints. As an instance, the first constraint defined by the first row of H specifies that c1 c3 c4 c6= 0.

»»

»

¼ º

««

«

¬ ª

0 1 1 0 1 1

1 1 0 1 1 0

1 0 1 1 0 1

H (1)

An efficient visualization of the parity-check matrix is the Tanner graph. Figure 2 shows the Tanner graph for the LDPC code in the above example. The upper nodes (circles) are called bit-nodes and depict the codeword bits. Similarly, the lower nodes (boxes) are called check-nodes and depict the parity-check constraints. The graph has an edge between a given bit-node j and check-node i, if parity-check i includes bit j (if H (i, j) = 1). The number of edges of the Tanner graph is equal to the number of ones in H. Equation 2 gives this number (E) for a regular (j, k) parity-check matrix HMu N.

N j M k

E u u (2)

(3)

The message-passing algorithm is an iterative algorithm for decoding LDPC codes [1-3]. It is based on passing messages among the bit-nodes and check-nodes of the Tanner graph. The upward messages from check-node 1 d md M to bit-node jNm are denoted by um,j, where Nm is the set of all the bit-nodes connected to check-node m in the Tanner graph. Each um,j is a probability measure that indicates which value of bit j (1 or 0) satisfies parity-check m. This probability measure is computed based on the messages received from all the bit-nodes connected to check-node m, excluding bit-node j.

Similarly, the downward messages from bit-node 1 d n d N to check-node iMn are denoted by vi,n, where Mn is the set of all the check-nodes connected to bit-node n (Figure 2). Each vi,n is a probability measure indicating if bit n is one or zero. This measure is computed based on the messages received from all the check-nodes connected to bit-node n, excluding check-node i.

j=1

k=3 r2

r1 r3 r4 r5 r6

vij

ukj

uij

vkj

i=1

3 4 6

Ni= {1, 3, 4, 6} for i = 1 Mj= {1, 3} for j = 1

Figure 2 The tanner graph for the regular (2, 4) LDPC code defined by H in Relation 1.

4. Implementing the Decoder Using the LARPBS Model

We employ an E-processor LARPBS to implement the message-passing decoder for the regular (j, k) LDPC code, defined by any sparse parity-check matrix HM×N, in constant time. E is the number of ones in matrix H and it is a small constant (Definition 1). For a Tanner graph with E edges, each update requires E message computations. The updates run in serial, as each update step requires the output of the preceding one. Thus, E is the minimum number of processors required for a fully parallel implementation.

We number the check-nodes, bit-nodes and the LARPBS processor in ascending order from left to right. Pi denotes the i^th LARPBS processor, for 1 d i d E. The algorithm output for bit n is the a posteriori log likelihood ratio On, and the inputs are as follows.

1. R, the channel output.

2. G², the variance of channel noise.

3. WCol, the column weight of matrix H. WCol is equal to j for a regular (j, k) LDPC code. (Note that the LARPBS is able to compute the column weight in one bus cycle, using

its conditional delay switches. However, for simplicity we consider the column weight as one of the inputs.)

4. Edge-list1: An array that includes the locations of ones in matrix H, in row-major order. It is the list of edges of the tanner graph in the order of connecting to the check-nodes.

For example, edge-list1 for the Tanner graph in Figure 2 is:

Edge-list1 = (1, 1), (1, 3), (1, 4), (1, 6), (2, 2), (2, 3), (2, 5), (2, 6), (3, 1), (3, 2), (3, 4), (3, 5).

The processors read the elements of edge-list1 as their input and build another array, namely edge-list2. This array includes the locations of ones in matrix H, in column-major order. Edge-list2 array lists the edges of the tanner graph in the order of connecting to the bit-nodes. For example, edge- list2 for the Tanner graph in Figure 2 is: Edge-list2 = (1, 1), (3, 1), (2, 2), (3, 2), (1, 3), (2, 3), (1, 4), (3, 4), (2, 5), (3, 5), (1, 6), (2, 6).

To implement the decoder, each processor in the LARPBS updates one upward message and one downward message per iteration. Each processor Pi simulates edge i in edge-list1 during the check-node update. Using this scheme, each group of consecutive processors on the LARPBS has the data for all the edges connected to one of the check- nodes and can segment the bus such that each segment represents that check-node.

Similar logic applies to the bit-node update. During the initialize stage, the LARPBS reorders the edges in edge- list1 to build another array called edge-list2. Each processor Pi simulates edge i in edge-list2 during the bit-node update.

Each group of consecutive processors on the LARPBS has the data for all the edges connected to one of the bit-nodes and can segment the bus such that each segment represents that bit-node.

A processor requires the message from the preceding update step before it can update its assigned message.

However, edge i in edge-list1 is not necessarily the same as edge-list2. Thus, processor Px may simulate a given edge during check-node update, while processor Py simulates it during bit-node update. To complete the next update for this edge, Py should send its updated downward message to Px

and similarly Px should send its upward message to Py. Two local variables namely, bit-index and check-index, provide the required addressing. Processor Pysets its local variable check-index to x and, processor Px sets its local variable bit- index to y to address each other upon completion of the bit - node and check-node updates, respectively.

For a better picture of the local variable bit-index, recall that the processors hold an edge in their local variables edge1. Each processor sets its local variable bit-index to the index of this edge in edge-list2 array. This index is a unique number in the interval [1 E], where E is the number of edges. Edges with larger column numbers have larger indices in edge-list2 array, and so a larger bit-index.

Consider H in Relation 1, and its edgelist1 and edgelist2.

Processor P2, for example, holds edge (1, 3) which is element 5 in edge-list2. Therefore, the value of bit-index for P2is 5. The list made by the local variable bit-index of all

(4)

processors for the above example is: Bit-index = [1 5 7 11 3 6 9 12 2 4 8 11].

Similarly, each processor sets its check-index to the index of its edge2 in edge-list1 array. This list has the property that if bit-index (j) = i then check-index (i) = j for all i, j in [1, E] interval. The list made by the local variable check- index of all processors for the above example is: Check- index = [1 9 5 10 2 6 3 11 7 12 4 8]. In the following sections we present the algorithm along with its time complexity and clarify it by an example.

4.1 The LARPBS Algorithm: Initialize

Figure 3 shows how the variables are initialized using the following steps.

Read the inputs. This step initializes the local variables G, WCol, edge1, and rank for the LARPBS processors and clears their local flag marked.

Initialize col-rank. Each processor computes the column rank for the edge it holds in its local variable edge1.

Column rank is a number in [1, WCol] interval, indicating the order of the edges on a given column. Figure 3 shows how the edges are ranked from the bottom to the top of each column. In this figure, * is a wildcard meaning any node.

Each processor holding edge (*, c) for 1 d c d N in its local variable edge1 sends its index to processor Pc. According to the priority write rule, each of the processors P1, P2 … PN

receives only the highest index sent to its address. It then sends its local variable rank to the processor at this index, and decreases rank by one. These steps require two bus cycles. The receiving processors save the message in their local variables col-rank, and mark themselves to stop sending their indices at the next run of the loop. This way, other processors on the same column also have a chance to set their col-rank. Each time the loop executes, all edges of the same column rank are processed.

The column rank varies between 1 and WCol. Thus, the loop runs WCol times, where WCol is a small constant for a sparse matrix H. Many efficient LDPC codes use WCol equal to 3. Thus, the LARPBS initializes col-rank for all the processors in O(1). The column rank of all edges as they appear in edgelist1 for the tanner graph in Figure 2 is: Col- rank = [1 1 1 1 1 2 1 2 2 2 2 2]. Note that WCol is 2.

Initialize bit-index. Each processor computes its local variable bit-index in constant time using Equation 3. The variables nc and rc are the column number and column rank of the edge that the processor holds in its local variable edge1. Also WCol is the column weight of the parity-check matrix.

Bit-index = (nc1)u WCol rc (3) Initialize check-index. Each processor sends its index to the address in its local variable bit-index (to processor Pbit- index) and saves the received message in check-index. The LARPBS completes this step in one bus cycle.

Initialize edge2, R, uP and vP. The LARPBS employs bit- index values to find the local variable edge2, in one bus

cycle. Then, the processors initialize their upward and downward messages.

.}

0 {

else

.}

1 {

message received than the less one is if

) (*, 2 holding processor

Each

. to sends ) (*, 2 holding processor

Each

.}

0 {

else

.}

1 {

message received than the less one is if

)

* , ( 1 holding processor

Each

. to sends )

* , ( 1 holding proc.

Each

/

* processors segmenting

Mark the /*

. / / 2 , 0 ), ( Read

) (*, 2 holding processor

Each

message.

received The 2

. processor

to 1 Send

processor Each

/

* and , , 2 variables local

the Initialize /*

message.

received The

. processor

to Send

processor Each

/

* variable

local the Initialize /*

. )

1 ( processor Each

/

* variable

local the Initialize /*

/

* for end /*

.}

1

message.

received The

message a receive that processors The

. 1 ,

to send

index.

received The

message a receive that processors The

. to send

0 and

) (*, 1 with proc.

Each {

to 1 from for

/

* processor each

for 1 of rank column the Find /*

. 0 ,

array.

1 of

element th Read 1 processor Each

. and , broadcasts and

reads bus the of head The

/

* Initialize /*

2

1 1 1

2

2 1 1 1

2

u

t bit-segmen

c

c edge P

P c c

edge P

ent check-segm

ent check-segm r

r edge P

P r r

edge P

R v

u j R R

j edge P

edge

P edge

P

v u R edge index

check

P i

P

x check-inde

rank col W c index bit

P

bit-index marked

rank col

rank rank P rank I

P i

marked c

edge P

W l

edge marked W

rank

list edge i

edge

P

į W

i

i i

i

i i

P P

bit-index i

P P bit-index

i

col I c i

col col

i

col

G

Figure 3The initialize stage for the proposed algorithm.

Mark the segmenting processors. As mentioned earlier, the processors can segment the bus such that each segment

(5)

represents a bit-node, or a check-node. A processor should segment the bus during the bit-node update, if it holds edge (*, i) in its local variable edge2, and the next processor to its right holds edge2 = (*, i+1). Figure 3 shows how the LARPBS marks such processors in just one bus cycle.

Similar logic employs the local variables edge1 to mark the segmenting processors during the check-node update.

Note that the processors require one bus cycle per step to exchange data during the following steps: input, initialize check-index, and initialize edge2 (Figure 3). Two bus cycles are required for marking the segmenting processors, and 2uWColfor finding the column rank, where WCol is a small constant, as explained earlier in this section.

4.2 The LARPBS Algorithm: Iterate

In the second stage of the algorithm downward and upward messages are updated, until the algorithm converges. Figure 4 shows the iterate stage for the proposed algorithm. uP and vp are the upward and downward messages for processor P. Formulas 4-7 are revised based on the original algorithm [3] to simplify routing the required messages among the processors. In this scheme, the processors compute and route a product P for each check- node and a summation S for each bit-node, instead of routing individual messages.

Check-node update. Each processor P uses its local variable check-index to send vP for its assigned edge to the processor that updates uP for that edge. Then, the processors with a check-segment flag equal to one segment the bus.

After forming the proper segments, the processors in each segment i team up to compute the product Pi, using tree structure multiplication [7]. The head of each segment holds the result and broadcasts it to every processor in the segment, which store it in their local variable P. Finally, each processor P computes its upward message uP by excluding its local data vP from its received product.

Bit-node update. Similarly, each processor P uses its local variable bit-index to send the upward message uP for its assigned edge to the processor that updates the downward message vP for that edge. Then, the processors segment the bus properly using their bit-segment flag and the processors in each segment i team up to compute sum Sⁱ, using tree structure summation [7]. The head of each segment holds the result and broadcasts it to every processor in the segment, which store it in their local variable S. Finally, each processor P computes its downward message vP by excluding its local data uP from its received sum.

The algorithm converges after lmax iterations, when there are no considerable changes in the output values, or we are within the desired degree of result accuracy. Practically, this happens after a constant number of iterations (usually no more than 10 or so). After convergence, the head of each segment i holds Oi, the output value for bit-node i. This output is used to decode bit i of the codeword.

Note that once the upward messages are updated, the same processors update the downward messages. Hardware reuse reduces the number of LARPBS processors. Also, for regular (j, k) LDPC codes the processors require 1+log (k) and 1+log (j) bus cycles to exchange data during check- node and bit-node updates, respectively. k and j < k are small constants compared to code length (Definition 1).

(7) Formula

/

* computes

* /

message.

received The

segment on processor Each

processors the

all to broadcasts head

The

/

* holds segment the of head The /*

(6) Formula

/

* sum structure tree by compute

* /

segment Each

bus.

the segments

1 to equal is flag it

if

message.

received The

. to

variable local its sends

processor Each

bus.

e connect th processors

All

/

* Update node Bit /*

(5) Formula ))

2 / tanh(

/ ( tanh 2

/

* computes

* /

message.

received The

segment on processor Each

processors the

all to braodcasts head

The

/

* holds segment the of head The /*

(4) Formula tanh 2

/

* tion multiplica structure

tree by compute

* /

segment Each

bus.

the segments

1 to equal is flag if

message.

received The

. processor

to variable local its sends

processor Each

. bus e connect th processors

All

/

* Update node Check /*

to 1 from counter iteration for

/

* Iterate

* /

1

max

P P

P i segment

P P i

P

bit-index P

P P

P i segment P

P P

x check-inde P

u v

v

i u

i P

segment b

u

P u P

v u

u

i v i

P

segment check

v

P v

P

l l

¸¹

¨ ·

©

§

¦

i

i i

i

i i

S S

S

S S

S P P

P

P P

P

O

Figure 4The iterate stage for the proposed algorithm.

4.3 Time Analysis

Theorem 1: An E-processor LARPBS implements the message-passing decoder for the regular (j, k) LDPC code

(6)

defined by any sparse parity-check matrix HM×Nin constant time, where j < k and E is the number of ones in H (Equation 2).

Proof: Consider the proposed algorithm. As explained in Section 4.1, the LARPBS initializes all the variables in constant time. The algorithm converges after a constant number of iterations lmax(Section 4.2). Each iteration of the algorithm (Figure 4) includes computing product P, sum S, and updating the messages. For a regular (j, k) LDPC code, P and S are the product of at most k and sum of at most j < k numbers, respectively. The LARPBS employs tree structure operations to compute these values in at most O(log k) time [7], where k is a constant (Definition 1). Then, each processor updates its message, and sends it to the next required processor in one bus cycle. As a result, the proposed algorithm runs in constant time.Ŷ

4.4 Decoder Example

Figure 5 shows the decoder for the LDPC code defined by matrix H in Relation 1. The LARPBS has 12 processors.

Recall that each processor simulates one edge during the check-node update and one during the bit-node update.

These edges are not necessarily the same. For example, consider edge (3, 1). It is the ninth element in edge-list1, and the second element in edge-list2, in this example. Thus, processor P9 and P2 compute the messages for edge (3, 1) during the check-node and bit-node updates, respectively.

These processors use their local variables bit-index and check-index to communicate their messages for edge (3, 1).

Therefore, P9 sets its bit-index to 2 and P2 sets its check- index to 9. Figure 5 shows bit-index and check-index for all the processors.

1 2 3

2

1 3 4 5 6

1 1

2 2

3 3

4 4

5 6 8 910 1112 6

5 7 8 9 10 11 12

e31 e34

Check-index:

Bit-index:

1 1

9 5

5 7

10 11

2 3

6 6

3 9

11 12

7 2

12 4

4 8

8 10

1 2 3 4 5 6 7 8 9 10 11 12

Figure 5The decoder of a regular (2, 4) LDPC.

5. Conclusions

We proposed a LARPBS algorithm for implementing the message-passing decoder of the regular (j, k) LDPC code defined by any sparse parity-check matrix HM×N, in constant time, using the minimum number of processors required for a fully parallel scheme. The LARPBS algorithm exploits

hardware reuse to decrease the number of processors. It is flexible to any code changes and employs dynamic reconfiguration to create the proper interconnection topologies between the nodes, as required by its parity- check matrix. Only a small constant number of bus cycles are required for efficient data exchange among the processors. Moreover, the same hardware can implement the decoder in both probability and logarithm domains.

References:

[1] R. G. Gallager, “Low-Density Parity-Check Codes,” IRE Trans. Inform. Theory, vol. IT-8, pp. 21–28, Jan. 1962.

[2] R. G. Gallager, “Low-Density Parity-Check Codes,” MIT Press, Cambridge, MA, 1963.

[3] John R. Barry, “Low-Density Parity-Check Codes,”

http://users.ece.gatech.edu/~barry/6606/handouts/ldpc.pdf.

[4] G. Al-Rawi, J. Cioffi, R. Motwani, M. Horowitz,

“Optimizing iterative decoding of low-density parity check codes on programmable pipelined parallel architectures,” IEEE Globecom Conf., Vol. 5 , pp 3012-3018, Nov. 2001.

[5] A.J. Blanksby, C. J. Howland, “A 690-mW 1-Gb/s 1024-b, rate-1/2 low-density parity-check code decoder,” IEEE Journal of Solid-State Circuits, Vol. 37, pp. 404-412, March 2002.

[6] B. Levine, R. R. Taylor, H. Schmit, “Implementation of Near Shannon Limit Error Correcting Codes Using Reconfigurable hardware,” IEEE Symp. on Field-Programmable Custom Computing Machines, pp. 217–226, April 2000.

[7] R. Vaidynathan and J. L. Trahan, Dynamic Reconfiguration:

Architectures and Algorithms, Kluwer Pub., 2003.

[8] J. L. Trahan, A. G. Bourgeois, Y. Pan, and R. Vaidyanathan,

“An Optimal and Scalable Algorithm for Permutation Routing on Reconfigurable Linear Arrays with Optically Pipelined Buses,”

Journal of Parallel and Distributed Computing, vol. 60, (2000), pp. 1125-1136.

[9] Z. Guo, R. Melhem, R. Hall, D. Chiarulli, and S. Levitan, Array processors with pipelined optical busses, J. Parallel Distrib.

Comput. 12 (1991), 269_282.

[10] K. Nakano, “A Bibliography of Published Papers on Dynamically Reconfigurable Architectures,” Parallel Proc. Letters, vol. 5, (1995), pp. 111-124.

[11] C. Qiao and R. Melhem, Time-division optical communications in multiprocessor arrays, IEEE Trans. Comput.

42 (1993), 577_590.

[12] S. Pavel and S. G. Akl, Integer sorting and routing in arrays with reconfigurable optical buses, in Proc. Int'l. Conf. Par.

Processing,'' pp. III-90_III-94, 1996.