Implementing LDPC Decoding on Network-On-Chip T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin

(1)

Implementing LDPC Decoding on Network-On-Chip T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin

Penn State University

{theochar, link, vijay, mji}@cse.psu.edu

Abstract

Low-Density Parity Check codes are a form of Error Correcting Codes used in various wireless communication applications and in disk drives. While LDPC codes are desirable due to their ability to achieve near Shannon-limit communication channel capacity, the computational complexity of the decoder is a major concern.

LDPC decoding consists of a series of iterative computations derived from a message-passing bipartite graph. In order to efficiently support the communication intensive nature of this application, we present a LDPC decoder architecture based on a network-on-chip communication fabric that provides a 1.2Gbps decoded throughput rate for a 3/4 code rate, 1024-bit block LDPC code. The proposed architecture can be reconfigured to support other LDPC codes of different block sizes and code rates. We also propose two novel power-aware optimizations that reduce the power consumption by up to 30%.

1. Introduction

Low Density Parity Check (LDPC) codes are a form of error correcting codes using iterative decoding, with the advantage that they can achieve near Shannon-limit communication channel capacity [1, 2]. They offer excellent decoding performance as well as good block error performance. There is almost no error floor and they are decodable in linear time. Consequently, LDPC codes are used for various wireless communications and for reliable high- throughput disk data transfers [2]. Design of power efficient LDPC encoders and decoders with low bit-error rate (BER) in low signal- to-noise ratio (SNR) channels is critical for these environments.

This paper presents a power-efficient reconfigurable architecture for an LDPC decoder which achieves a high throughput even in noisy channels.

textR B text R B textR B text R B

textR B text R C textR C text R C

textR

text R C textR C text R C

textR B text R C textR C text R C

textR B text R B textR B text R B textR

B textR B textR

B

1 0 0 1 0 0 1 0 0 B

0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0

Parity check matrix, H

Check nodes

Bit nodes

A B C D E F G H I a

b c d e f

A B C D E F G H I a b c d e f

Figure 1: H-Matrix – Bipartite Graph – Network on Chip An LDPC code is a linear message encoding technique, defined by a set of two very sparse parity check matrices, G and H. The message to be sent is encoded using the G matrix. When it reaches its destination, it is decoded using the H matrix. In contrast to the encoder, the LDPC decoding is computationally intensive and consists of iterative computations derived from a message-passing bipartite graph (shown in Figure 1). Message passing iterations are performed by the two computation units - the bit node and the check node [2]. The LDPC algorithm is communication intensive and that selection of the interconnect structure of the decoder architecture is critical.

LDPC codes offer excellent communication over noisy channels, hence design of LDPC decoders has been heavily

researched. Researchers in [3-7] conclude that the primary problems in hardware implementations are the high memory requirements per node to store computed results, and the complicated interconnect structures. Attempted implementations addressing the mentioned problems are either limited in the amount of types of LDPC codes supported, or constrained by the hardware architecture (FPGA implementation).

In this paper, we propose a LDPC decoder architecture based on a network-on-chip interconnect [9] to efficiently support the communication intensive nature of the application. This architecture exploits the inherent parallelism of the decoding algorithm and has been implemented using 160nm CMOS technology. The resulting decoder operating at 500MHz is shown to provide an effective decoded throughput of 1.2Gbps for a 1024- bits per block, 3/4 code rate LDPC code. The architecture can be reconfigured to support other code rates and block sizes based on the desired error protection demanded by the application.

Similarly, the architecture supports both regular and irregular LDPC codes through the use of reprogrammable configuration memories. The architecture also includes power reduction optimizations that exploit the intrinsic properties of the decoding algorithm. We demonstrate that these techniques reduce the power consumption of the decoder by 30.11% in the above configuration.

2. LDPC Decoder Architecture

2.1 Algorithm Overview

We now briefly review the operations performed by an LDPC decoder. Details of the algorithm can be found in [1-3, 8]. The decoder consists of two major computation units, the bit node and the check node, which iteratively communicate with each other with the common objective in decoding a message bit. When a message is about to be decoded, for every message bit, the corresponding bit node receives an initial bit likelihood ratio, defined as the ratio of the probability of the bit being 1 to the probability of the bit being zero. The computation starts from the bit node. Each bit node receives the initial bit likelihood ratio, passing it to every check node that the current bit is associated with. Each check node receives a set of likelihood ratios (lri_old) from the various bit nodes it is associated with. Once the entire set arrives, the check node computes and sends new likelihood ratios (lr_{i_new}) to each bit node. Computing the operation in the log domain results in significant advantages [8], hence likelihood ratio becomes now logarithmic likelihood ratio (llr).The check node is responsible for computing the llr for each bit participating in a check. The new llr depends on the llrs of the other bits participating in that check. The check node operation is shown in Eq. 1. Details of the operation are described in [3].

>

old old n old i old

@

new

i B Bllr Bllr Bllr Bllr

llr_{_} ₀_{_} ₁_{_} _{_} _{_}

(Eq.1 – Check Node Operation) The function B (llr) is the logarithmic bilinear transform,

_

_ _

( ) ln(1 1

i new

i old i old

llr llr B llr e

e

)

.

This function is implemented using a ROM based look-up table (LUT) for the check node computation and overall sign-magnitude data representation for the entire computation based on our prior This work was supported in part by grants from NSF CAREER 0093085,

NSF 0130143, NSF 0202007 and MARCO/DARPA GSRC-PAS

(2)

work described in [8]. The check node computes the new llrs for all bits participating in each check, and sends each result back to the originating bit node. The bit node now in turn, accumulates the new llrs it receives from all checks that the bit participates in, excluding the current bit, until the llr converges towards either negative or positive infinity. Convergence to positive/negative infinity translates to the decoded bit being either 0 or 1 respectively. The operation (log domain) is outlined in Eq. 2:

i n i new

new

i stored llr llr llr llr llr

llr_{_} _ ₀ ₁

_{_}

(Eq.2 – Bit Node Operation) The term stored_llr describes the previously stored logarithmic likelihood ratio for that bit or the initial logarithmic likelihood ratio if this is the first iteration. The rest are the results from each check node that the bit participates in. The new llr is stored in the bit node and also sent back to the participating checks to complete the computation. The bit node is responsible for controlling the entire operation on each bit, as it controls the number of iterations necessary for convergence.

A critical design parameter which affects both the performance and the power consumption of the decoder is the bit width of the data word used in the computation. A large data word results in a lower BER even in noisy channels with SNR of 0.5db, but at the cost of increased power consumption. Power consumption is therefore dependent on the amount of noise expected in the channel, as to achieve a low BER we need both a more precise number representation as well as an increased number of iterations of the algorithm to achieve convergence. Note that the power optimizations we present here can be applied to all designs, regardless of bit-width. Our test case uses a 16 bit signed- magnitude fixed-point representation.

2.2 Network on Chip (NoC) Architecture

The decoder is implemented as an on-chip network, where the bit and check nodes act as processing elements (PEs), which communicate via on-chip network routers, as shown in Figure 1. In the example shown, the network consists of 25 nodes, 16 bits and 9 checks, communicating in a 2D mesh topology. By providing configuration memory to each PE, we can map more than one node on a single PE. In the case that there are virtual nodes that cannot be mapped all at once on a single chip, we use the serial approach outlined in [3].

The architecture is initialized to reflect the mapping of the nodes of the bipartite graph onto the physical nodes. For the given LDPC code, the H matrix is used to generate the configuration data which includes the number of virtual nodes mapped on to a PE and the connectivity information capturing the communication pattern between the different nodes. Each PE has a dedicated memory to store this configuration information. The configuration information is routed from the I/O to the individual PEs using packet based communication.

After initialization, computation enters the decoding stage, where data comes into the network in the form of logarithmic likelihood ratios. The bit nodes then load the incoming data into their memories, and initialize the computation, which develops into the iterative message passing between checks and bit nodes. When the specified number of iterations have completed (and hence the values of individual bits have converged), the bit node outputs the bit value to the I/O. The on-chip communication is detailed in the next subsection.

2.3 On-Chip Communication

Inter-PE communication is handled by an on-chip network consisting of a number of small on-chip routers. In our implementation of the network, a simple 2D mesh topology is

used, with deterministic XY routing as the routing algorithm. As losing a single data value is catastrophic to LDPC computation, our network sends on/off backpressure information, preventing packet drop. This inability to drop packets removes the need for complicated ACK/NACK based protocols, and reduces storage and energy requirements throughout the system. Each packet contains not only a physical destination address in the network (which requires 5 bits in the 25-node example network), but also contains 6 bits of virtual identification information that identifies which of the 64 virtual nodes in a physical node the packet is destined for. In addition, the packet contains virtual identification information that identifies from which particular partnered virtual node the packet was sent, allowing return address information to be extracted. As we assume that a given node will never communicate with more than 16 partners, this requires an additional 4 bits. Finally, a single bit is reserved to mark configuration packets, for a total header size of 16 bits per packet.

Each packet consists of a single 48-bit phit, and as such, there is no distinction between wormhole and packet switching. The routers themselves are not pipelined, and as such, a full packet of data moves one hop per clock cycle. The distance the packets must move across the network is minimized by intelligently mapping the bipartite graph onto the physical network, resulting in an average network distance half of that obtained with random placement.

In a direct implementation of LDPC decoding on NoC, as the analog signal arrives and is converted into llr values after ADC conversion, the llr values are grouped into message blocks. The llr values from each message block are then packetized and sent into the bit nodes. Each output value during the decoding process is also contained in its own unique packet, for a one-to-one ratio of header bits to data bits. In our architecture, two blocks are decoded in parallel, allowing 66% of the network traffic to be useful data.

Both message blocks are decoded using the same H-matrix (an thus communication pattern) allowing for an increased device throughput at the cost of a slightly increased decode latency for every other packet. This behavior is shown in Figure 2 below.

Packets To Bitnodes Packetized Sent

Values

H D1D2

D2 D1 H

H D1D2

ADC ADC

Stream Incoming Message

Message 1

Message 2

Message 2 Decoded Message 1 Decoded Decode

c c c b b b

Figure 2: Message Decoding Behavior 2.4 Bit Node PE Architecture

P A C K E T D E C O D E R

Data Concentrator

(DC)

Configuration Memory (CM)

P A C K E T G E N E R A T O R HEADER FLOW TABLE (HFT)Header

Header Header read

Header Write

Execution Unit (EU)

Valid 6

llr1 llr2 Init

Comp ID Ready Ready

Valid Data1 Data2

Ready Valid Data1 Data2 16

16

Stage 1 Stage 2 Stage 3 Stage 4

Data

Valid

Ready

Data

Valid Ready

(PD) (PG)

Figure 3: Bit Node Architecture

The bit node architecture is shown in detail in Figure 3. The interface to the network consists a single 48-bit input port and a 48-bit output port, each capable of transmitting an entire packet per clock cycle. As each packet arrives, the data concentrator directs values to the appropriate accumulator, based on the virtual identification information in the packet header. Once all input values for a given virtual bit node are received, the computation proceeds to the execution unit, where the individual llr values are Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design (VLSID’05) 1063-9667/05 $20.00 © 2005 IEEE

(3)

subtracted from the accumulated sum, as in Equation 2 (see page 4). Finally, the source information contained in the header is used to index the header flow table, generating the header for the outgoing packet containing the new llr. Note the parallel 16-bit datapath throughout the bit node, supporting the lockstep computation of two messages simultaneously.

2.5 Check Node PE Architecture

The check node architecture is shown in Figure 4. The check node has to compute the new llrs based on Equation 1. The data flow through the check node is similar to that of the bit node, with the most obvious difference being the addition of four computational units (CU’s), which perform the bilinear transform in the logarithmic domain. The first two CUs, located just as the packet arrives, perform the operation on each individual incoming data value, allowing the accumulators in stage four to perform the summation . After all values have been received for a given virtual check node, the execution unit subtracts away the individual B(llr) values and performs the bilinear transform in the logarithmic domain once more, completing the operations found in Equation 1. Since each output llrx is generated in the same order as the packet associated with x arrived, the stored headers are scrolled out and used as inputs to the header flow table, allowing each packet to be addressed properly to return to the appropriate bit node. Again, note that the node supports the simultaneous decoding of two independent message blocks, allowing for lockstep operation.

Header Flow Table Data

FIFO hn FIFO h2 FIFO h1

CLK RST Ready Valid

Valid Data

FIFO dn FIFO d2 FIFO d1

PacketDecoder DataRedirector

CU 2

Control Logic (Clock, Data Flow management)

PacketGenerator

+=

Arbiter

116

CU 1 HEADER BUFFER

-¹

Execution Unit

CU 3 llr 16

-16¹ CU 4 llr Ready

DATA CONCENTRATOR Stage1 Stage2Stage3

Stage4

Stage 5 Stage 6

Stage 8 Stage 7

Figure 4: Check Node Architecture

3. Experimental Platform and Results

The first step in evaluating our architecture was to obtain the H matrices for the desired LDPC codes; for this we used the LDPC software implementation by [10] to generate multiple H matrices based on inputs such as desired block size, regular/irregular codes and code rate. Given that our architecture is designed to accommodate multiple types of LDPC codes, we generated matrices which consisted of both regular and irregular LDPC codes. Next, we designed a 5x5 2D mesh network, consisting of 16 physical bit nodes and 9 physical check nodes as the PEs, in Verilog and synthesized it using commercial 160nm cell library.

Each PE supports up to 64 virtual nodes, allowing simultaneous on-chip decoding of 1024 bits. The chosen H matrices however were of various column sizes, therefore the block decoding size varied from 512 to 1024 bits. The synthesized design operates on a 500MHz clock at 1.8V Vdd. It occupies an area of ~110mm².

In order to verify the functionality of our design, we used a cycle-accurate NoC Simulator, NOCSim [11], to model our design and perform cycle-accurate simulation that captures both the initialization and computation phases of the decoder using the input H-matrix. We verified the decoded output from our simulated design to that obtained using the software implementation of [10]

to obtain the bit error rates. This simulation also provides us with the number of cycles required for completing the decoding. When combined with the clock frequency of our synthesized designs, the proposed architecture is determined to provide a decoded bit throughput of 750 Mbps when a maximum number of iterations is not set (instead the bit nodes had to converge to positive or negative infinity). This is however a pessimistic approach; setting a maximum iteration count and determining the bit value from the ratio at the point where the maximum iterations are completed, gives a higher throughput. The number of iterations is related to the accuracy of the computation, as convergence is not guaranteed with a lower number of iterations. For low noise channels, it is desirable to limit the number of iterations, as it will result in a higher throughput, with the BER not as affected as in high noise channels.

Next, we used the binary data patterns captured from the cycle- accurate NOCSim simulation as input pattern to our synthesized design and simulated that using Power Compiler to obtain the power consumption values. For the irregular LDPC code, the average power consumption in decoding 1024 bits is 34.8 Watts, a significant portion (43%) of which is consumed in the on-chip interconnect. The check nodes (23%) also consume more power than the bit nodes (22%) as expected given their larger pipeline length and the computation complexity. Approximately 12% of the chip power is consumed in leakage power.

4. Power Optimization

4.1 Power Driven Message Encoding

Bits S1-S2 Bits 0-15

Zero 0000000000000000 00 <previous data value>

+infinity 0111111111111111 01 <previous data value>

All Others XXXXXXXXXXXXXXXX 10 <new data value>

- infinity 1111111111111111 11 <previous data value>

Transmitted Data Represented Value Binary data to be transmitted

Figure 5a: Message Encoding Scheme.

PE

^NETWORK

Router

S1 S2

S1 S2 X-BAR

CONTROL header

16 bit data payload

header 16 bit data payload

Figure 5b: Encoding scheme as applied in NoC routers.

The LDPC algorithm involves a series of messages passed between each node, eventually resulting in each computed log likelihood ratios converging at either positive or negative infinity.

Implementing this in hardware results in a large number of values passed between nodes truncated to either zero, negative infinity or positive infinity. A range between 25% and 40% of the total data values passed between each node are either zero or +/- infinity.

This implies a high switching activity on the interconnect, as the most commonly transferred values are the compliment of each other. We can however take advantage of this LDPC behavior, by encoding these values using 2 additional bits and avoid switching activity when these values are transmitted. We propose the introduction of two additional bit lines within the data bits, which we label as S1 and S2. Leaving the data bit lines untouched, at the output of each node (the packet generator), we check to see if the result is either zero, or +/- infinity. If this is the case, we do not touch the message to be transmitted – instead we only set S1 and S2 to the corresponding value as shown in Figure 5a, leaving the output buffers and thus the interconnect at the previous value. The node output port only changes two of the bits. The switching activity is therefore limited only to two bits. The scheme is applied across the network; routers which propagate the messages are responsible for also applying the scheme. Each router - in parallel with determining the destination PE (i.e. routing operation) - looks at the two bits. In the case of an encoded message, the router

¦^B^(llr^{_ i}⁾

(4)

simply forwards the two bits to the destination, leaving the rest of the data bit lines unaffected. Figure 5b shows the router hardware implication.

We simulated the proposed scheme using the same procedure detailed in Section 3.5, and found that this optimization reduces the total power consumption to 30.36 W from 34.8W, constituting a power saving of 12.75 %.

4.2 Early Termination

The LDPC algorithm is based on iterative convergence of a bit probability towards zero or one [1]. A particular property of the LDPC algorithm is the convergence pattern. Convergence patterns have been studied extensively among researchers, and there are several suggestions that deal with theoretical limits and suggesting for particular ways to terminate [12]. The convergence patterns can be more deterministic due to the finite precision used in our hardware implementation and can provide a useful ground for optimization.

Convergence Patterns

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

# of Iterations

Bit Likelihood Ratio

Convergence to 1-Case1 Convergence to 1 - Case 2 Convergence to 0 - Case 1 Convergence to 0 - Case 2

Initial Ratios

Final Ratios Convergence patterns separate after a number of iterations

Figure 6: LDPC Convergence patterns

Figure 6 shows the convergence patterns for four different cases, where the initial probability leads to a convergence towards zero or towards one. After a set of initial iterations where the outcome is not determined (approximately 10), the convergence patterns start to separate either towards zero or towards 1. This particular property can be exploited in our implementation and allow a PE to produce its result sooner than it would have otherwise taken. The iteration count where the convergence cannot be determined is dependent on the amount of noise present in the communications channel.

A low noise channel would result in the convergence patterns to separate faster, whereas higher noise levels would result in the iteration count larger. Determining the optimal iteration for a pattern to converge can be done by keeping track of the number of iterations for the previous computed bits. The hardware to do this operation is simple. For each virtual bit node mapped on a physical bit node, we need to detect a pattern whose outcome can be determined by looking at the previous ratios in that pattern. We do that by counting the number of incoming llr values which fall under (over) a certain threshold llr. Our simulations have shown that when five consecutive incoming llr values are under 0.25 or over 0.75, then we can set the convergence outcome immediately without the need to further iterate. When convergence has been determined, the bit node pipeline can be bypassed, as the bit node simply needs to send the determined bit value to all appropriate check nodes. Reducing the number of iterations (and hence computations) results in both savings in power consumption and increased throughput, with the only drawback being a slight (less than 1%) increase in the BER. Also, as the converged bit values fall within the values described in Section 4.1 (0, ± ) the message

encoding scheme discussed in Section 4.1 applies, resulting in further savings. When we combine this technique with the message encoding scheme, there is a 30% reduction in system power consumption (24.32W) as compared to the non-optimized implementation of Section 3 (34.8W). The hardware overhead to achieve both optimizations is less than 5%, while the BER remains largely unaffected. Note that while the BER and system power are based on a specific channel with SNR of 0.5, particular H matrix, and incoming message, the optimizations we discuss apply to all LDPC codes and incoming signals. Using the early termination approach, we can also reduce leakage power consumption by turning off the power supply to the computation pipeline of check nodes and bit nodes which are not computing.

5. Conclusions

The paper presents the design of a high throughput LDPC decoder architecture based on a scalable network-on-chip interconnect. The decoder maintains a low BER even through noisy channels, and has the ability to support multiple LDPC codes of varying types. The use of NoC design allows for a high performance scalable interconnect that offers dynamic communication in highly parallel architectures, and provides a convenient platform for LDPC decoding. As the decoding of other graph based codes, such as turbo codes, is also interconnect driven, we plan to extend our work into the design of other reconfigurable low-power decoders. By developing methods of increasing efficiency of the interconnect structures, such as those presented in this work, highly advanced error protection can be extended into a large number of application domains.

6. References

[1] R. G. Gallager. “Low-Density Parity-Check Codes”, IEEE Transactions on Information Theory, Jan. 1962, pp. 21-28.

[2] D. Mackay R. Neal. “Near Shannon limit performance of low density parity check codes”, IEEE Electronics Letters, Vol.33, no.

6, March 1997, pp 457-458.

[3] E. Yeo, B. Nikolic and V. Anantharam. Architectures and Implementations of Low-Density Parity-Check Decoding Algorithms, invited paper at IEEE IMSCS, Aug 4-7, 2002

[4] M. M. Mansour and N. R. Shanbag, “Low-Power VLSI Decoder Architectures for LDPC Codes”, Proc. of the 2002 ISLPED, Page(s): 284-289.

[5] L. W. Lee, A. Wu. “VLSI implementation for low density parity check decoder”, Proc. of the IEEE ICECS, 2001. Volume:

3, 2-5 Sept. 2001, pp:1223 - 1226

[6] C. Howland and A. Blanksby, “A 690mW 1Gb/s 1024-Bit Rate-1/2 Low Density Parity Check Code Decoder”, Proceedings of the 2001 IEEE CICC, San Diego, CA , May 2001, pp. 293-296 [7] B. Levine et al. “Implementation of Near Shannon Limit Error- Correcting Codes using Reconfigurable Hardware”, Proceedings of the IEEE FPCCM, 2000. Page(s):217-226.

[8] T. Theocharides, et.al. “Evaluating Alternative Implementations for the LDPC Check Node Function”, Proceedings of the IEEE ISVLSI, February 2004

[9] L. Benini and G. De Micheli. “Networks on chips: a new SoC paradigm”, IEEE Computer, Volume 35, pp. 70--78, January 2002 [10] R. M. Neal, “Software for Low Density Parity Check Codes”, ftp://ftp.cs.utoronto.ca/pub/radford/LDPC-2001-11-18/index.html [11]http://www.ece.cmu.edu/~djw2/NOCsim/

[12] S. Brink, “Convergence Behavior of Iteratively Decoded Parallel Concatenated Codes”, IEEE Transactions on Communications, VOL. 49, No. 10, October 2001, pp: 1727-1737.