Performance Comparison with Convolutional Codes

CHAPTER 3 HIGH-SPEED COMMUNICATION SYSTEMS WITH LDPC CODES .18

3.2 E RROR -C ORRECTING P ERFORMANCE OF LDPC C ODES IN UWB S YSTEM

3.2.3 Performance Comparison with Convolutional Codes

In Fig. 3.8, the performance of LDPC codes is compared to the 64-state convolutional coded system proposed in [23] where two different rates after puncturing the R = 1/3 convolutional code are selected as the references. It shows that both LDPC codes can outperform the convolutional codes after puncturing with only 8 iterations. The short block length and small decoding iterations will facilitate high speed implementation.

0 1 2 3 4 5 6 7 8 10^-6

10^-5 10^-4 10^-3 10^-2 10^-1 10⁰

R=3/4 convolutional code R=5/8 convolutional code (600,450) LDPC code with 8 iters.

(1200,720) LDPC code with 8 iters.

BER

SNR [dB]

(a) BER

0 1 2 3 4 5 6 7 8

10^-3 10^-2 10^-1 10⁰

R=3/4 convolutional code R=5/8 convolutional code (600,450) LDPC code with 8 iters.

(1200,720) LDPC code with 8 iters.

PER=8%

PER

SNR [dB]

(b) PER

Fig. 3.8 Performance comparison for different codes

Chapter 4 Architectures of Proposed LDPC Code Decoders

The architectures of the proposed LDPC code decoders for two different LDPC codes, Code I and Code II, will be introduced in this chapter. Basic functional units, data flow rescheduling and memory arrangement methods will be discussed in detail. The measurement results of the proposed LDPC code decoder chips and a comparison with the state-of-the-art designs will also be listed. The specifications of Code I and Code II are summarized in Table 4.1.

Table 4.1 Summary of the two LDPC codes

Code I Code II

Block length 600 1200

Information bits 450 720

Code rate 3/4 3/5

Code structure Irregular Irregular

Column weight 3 3

Row weight 11~14 7~9

4.1 Introduction to the Conventional Design

Based on the decoding algorithm, the block diagram of conventional LDPC code decoder is shown as Fig. 4.1. The bit node unit (BNU) is dedicated to the vertical step, while the check node unit (CNU) is used for the horizontal step. The BNU (or CNU) reads and processes the messages stored in the memory bank, and write them back into the memory bank after updating. It can be noticed that a large number of combinational feedback paths exist between the CNU (or BNU) and the memory unit, leading to the complex signal routing as well as degradation of the decoding speed in the VLSI implementation.

Memory

BNUBNUBNU Bank CNUCNUCNU

Channel value

Fig. 4.1 Block diagram of conventional LDPC code decoder

The conventional architecture of the CNU which is based on the LLR-SPA in (2.33) is shown in Fig. 4.2(a). The look-up tables (LUT) are used to implement the hyperbolic tangent (tanh) and inverse hyperbolic tangent (tanh^-1) functions.

The CNU can be implemented based on the min-sum algorithm as shown in Fig. 4.2(b) to reduce the hardware cost. As described in (2.38), the operations in the CNU can be divided into two parts: the sign evaluation and the minimum absolute value searching. The minimum

absolute values are searches by k comparators which consist of k-1 inputs (CMP-(k-1)), where k is the row weight of the parity check matrix.

LUT-1 LUT-1

∑

-LUT-2 LUT-2

LUT-2 LUT-2 LUT-1

LUT-1

(a)

CMP-(k-1)

CMP-(k-1) Sign Bit Evaluation

min

(b)

Fig. 4.2 Architecture of conventional CNU based on: (a) LLR-SPA and (b) min-sum algorithm

The conventional BNU architecture with k inputs is shown in Fig. 4.3, where the SUM-(k-1) is used to sum up k-1 values. Note that there is no difference on the BNU design between the LLR-SPA and the min-sum algorithm. Both LLR-SPA and min-sum algorithm have the same BNU design.

SUM-(k-1)

channel value

Fig. 4.3 Architecture of conventional BNU

4.2 Proposed LDPC Code I Decoder Design

The LDPC code decoders have inherently parallelism due to the non-dependency among check node updates or bit node updates; the throughput can be improved by linear increase of the hardware costs. However, the full-parallel implementation [9] is non-area-efficient for a system chip design. Therefore the partial-parallel architecture is employed in the proposed decoders to reduce circuit complexity according to the system requirements. In time-division multiplexing mode, the partial-parallel LDPC code decoders map a certain number of check nodes or bit nodes into a single processing unit. Extra decoding latencies are produced as compared with the full-parallel implementations. Thus a trade-off is made between the decoding speed and the hardware complexity. Besides, to simplify the hardware cost, the min-sum algorithm is chosen to implement the proposed design while keeping the system performance.

Fig. 4.4 presents the architecture of the proposed LDPC Code I decoder containing the distributor, memory unit, switch groups, CNU and BNU. Since the irregular parity check matrix H has a fixed number of column weight (= 3), the total number of weight in parity check matrix is 600 × 3 = 1800. To implement the decoder in a partial-parallel mode, the check nodes in the corresponding bipartite graph are partitioned into three parts, and the bit nodes are divided into four parts as shown in Fig. 4.5, where every three check nodes share a single CNU, and every four bit nodes share a single BNU. Therefore 150/3 = 50 CNUs and 600/4 = 150 BNUs are required in the proposed design. The switch groups in Fig. 4.4 are used to select which part of check nodes or bit nodes is under operation.

Switch 2

Fig. 4.4 The architecture of LDPC Code I decoder

Parity check matrix

H

^CNUset

BNU set

b1 b2 b3 b4

Fig. 4.5 The partition for parity check matrix H of Code I

Due to the random-like connections in the bipartite graph, the signal routing problem causes serious difficulties in the decoder implementation. As shown in Fig. 4.1, the combinational feedback paths leads to the degradation of the decoding speed and the routing area overhead in the VLSI implementation. In the proposed design, the pipeline registers are inserted in CNUs and BNUs to cut off those feedback paths as illustrated in Fig. 4.6. Thus, shorter critical path delay that reduces routing congestion can be achieved with little increases in the hardware costs.

BNU-PATH 2 CNU-PATH 1

BNU-PATH 1 CNU-PATH 2

Flip

Flop CNU Memory

Bank

Flip

BNU Flop

Fig. 4.6 Data path of proposed partial-parallel decoder

4.2.1 Channel Value Interconnection

For the conventional design in Fig. 4.1, both the CNUs and BNUs have to be connected to the channel values, which lead to large number of signal connections. Thus data

rescheduling is proposed to solve this problem in Fig. 4.7.

Fig. 4.7 Proposed LDPC decoding flow

As shown in Fig. 4.7, one extra vertical step is employed to replace the initialization through the CNUs. Recall equation (2.34)

( )\ only summations among the channel value L(xi) and the messages L^C→B(e^ij) are performed in the BNUs. If the messages LC→B(eij) are set to zero during initialization, the channel values are thus loaded into the memory through the BNUs, and fed to the CNUs for the first horizontal step. In this scheme, only BNUs have to be connected to the channel values as illustrated in Fig. 4.4, leading to less signal routing costs with some increases in decoding latencies.

Fig. 4.8 gives the timing diagram of the proposed LDPC Code I decoder, where bi and ci

correspond to the active BNU and CNU set in Fig. 4.5. The design takes nine cycles to complete a decoding iteration, including 4 cycles for horizontal steps with the CNUs and 5 cycles for vertical steps with BNUs. Additional five cycles are used to complete the channel value loading as described above. Thus total 9*8 + 5 = 77 cycles are required to finish the decoding process of a codeword with 8 iterations.

Timechannel value loadingiteration #1

Fig. 4.8 Timing diagram of the proposed LDPC Code I decoder

4.2.2 Check Node Unit

As shown in Fig. 4.2(b), k comparators which search the minimal values among k-1 inputs are needed to implement the CNU based on the min-sum algorithm. As mentioned in [18], equation (2.38) can be modified as

( ) ( )

where “2^nd min” denotes the value which is smaller than all the other candidates except the minimal one. According to (4.1), the absolute value searching has to be performed only one time to find the minimum and the second minimum. Fig. 4.9 shows the block diagram of the compare-select unit (CS14) which searches for the minimal and the second minimal values from 14 inputs.

CMP-14

Fig. 4.9 Block diagram of CS14

Because the column weight of Code I is ranging from 11 to 14, the CNUs dealing with different number of inputs should be designed. In this section, only the 14-input CNUs are introduced and others are designed in the analogous approach. The detailed architecture of CMP-14 in Fig. 4.9 is illustrated as Fig. 4.10, which consists of the pipeline registers and two

kinds of comparators: CMP-2 and CMP-4. CMP-4 finds out the minimal and the second minimal values from the four inputs, a, b, c, and d. In addition, CMP-2 is a two input comparator which is much simpler than CMP-4.

SUB SUB SUB SUB SUB SUB

a b a c a d b c b d c d

min 2^nd min MSB5

5 5 5 5 5 5 5 5 5 5 5 5

1 1 1 1 1 1

MSB4 MSB3 MSB2 MSB1 MSB0

5 5

min 2^nd min Decoder

Fig. 4.10(a) Block diagram of proposed CMP-4

Fig. 4.10(b) Block diagram of proposed CMP-14

The proposed architecture of the 14-input CNU is shown in Fig. 4.11, where SM14 is sign-multiplication. To facilitate the operations on the sign and absolute value, all the 6-bit

values have been represented by the sign-magnitude notation with 2 integer bits and 4 fractional bits. The combinational path in the CNUs is cut off into CNU-PATH1 and CNU-PATH2 by the pipeline registers, leading to shorter critical path delay that reduces routing congestion.

Fig. 4.11 The proposed 14-input CNU architecture

Table 4.2 lists the comparisons of three different CNU architectures. The LUT-1 and LUT-2 in Fig. 4.2(a) are implemented in 6-bit precision, including 2 integer bits and 4 fractional bits. The proposed CNU has the smallest size which is only about 22% of the others, whereas the maximum achievable operating speed is only a little smaller than conventional MS designs. Due to the fixed point implementation, some performance loss is produced. As a result, the decoder is implemented efficiently by using of the proposed CNU architecture.

Table 4.2 Comparison of different CNU architectures LUT

Fig. 4.2(a)

Conv. MS Fig. 4.2(b)

Proposed Fig. 4.11

Max. speed 162 MHz 261 MHz 250 MHz

Gate count 7.16 K 6.86 K 1.6 K

Total gate count 358 K 343 K 80 K

4.2.3 Bit Node Unit

Fig 4.12 shows the block diagram of BNU. According to equation (2.34) and (2.35), the BNUs receive the channel value and the message values linked to the same bit node. All inputs with sign-magnitude (SM) notation are converted to be 2’s complement (TC) representation, and summed to perform the updating calculation. The pipeline registers are inserted to break the critical paths into BNU-PATH1 and BNU-PATH2 as in the CNUs.

Finally, all the values are converted back to the SM notation and clipped to avoid overflow.

And the most significant bit (MSB) of the summation of the three input messages and the channel value is used to decide the estimated codeword bit.

All the 6-bit values are quantized with 2 integer bits and 4 fractional bits, while the intermediate summations are represented with 4 integer bits and 4 fractional bits.

Fig. 4.12 The proposed BNU architecture

Note that if C1, C2 and C3 are set to be zero during initialization, the channel value will be directly bypassed to the outputs of BNU. This produces a path to load the channel values into the memory as mentioned above.

4.2.4 Chip Implementation

The proposed LDPC Code I decoder was implemented within an LDPC-COFDM UWB baseband transceiver chip [25] with the 0.18 µm 1P6M standard CMOS process. The chip micrograph of the entire UWB transceiver including the OFDM modem and the LDPC codec is given in Fig. 4.13. The encoder die size is 2.25 mm², while the decoder die size is 16.5 mm². The total gate count of the LDPC codec is 542 K, where 70K is for the encoder and 472K is for the decoder.

The chip has been tested to verify the functional correctness. The measured maximal data rate of the decoder is 480 Mb/s while working at 82.1 MHz, and consuming 232 mW.

The detailed chip features are also summarized in Table 4.3.

OFDM Modem

LDPC Encoder

LDPC Decoder

Fig. 4.13 Die micrograph of the LDPC-COFDM UWB transceiver chip

Table 4.3 Summary of the LDPC Code I Chip

Technology Standard 0.18-µm CMOS 1P6M

Package CQFP-208

Supply voltage 1.8V core, 3.3 V I/O

Encoder 1.5mm × 1.5mm

Chip size

Decoder 5.0mm × 3.5mm

Encoder 70K

Gate count

Decoder 472K

Power dissipation 232mW @ 82.1MHz

Maximum data rate 480Mb/s

4.3 Proposed LDPC Code II Decoder Design

In Sec. 4.2, the proposed LDPC Code I decoder design is introduced and silicon proven to achieve 480Mb/s maximum data rate. The performance of LDPC code I decoder is acceptable for the MB-OFDM UWB system [23], but may be not for other high-speed communication systems mentioned in Chap. 3. As a result, the LDPC code II decoder is proposed to get better error-correcting ability and higher decoding throughput.

While considering circuit complexity, the 480 × 1200 parity check matrix H of LDPC code II are divided into four 240 × 600 sub-matrixes to fit partial-parallel architecture, which is shown in Fig. 4.14. Since matrix H of Code II has a fixed number of column weight (= 3), the total number of weight is 1200 × 3 = 3600. Based on this partition, the functional units in the decoder will process 1800 messages every cycle.

H =

h

₀₀ ^{CNU Set 1}

CNU Set 2

BNU Set 1 BNU Set 2

h

Fig. 4.14 The partition of parity check matrix H of Code II

The proposed LDPC code II decoder architecture illustrated in Fig. 4.15 contains the input buffer, 240 CNUs, 600 BNUs and two dedicated message memory units (MMU). The set of data processed by CNUs are {h00, h01} and {h10, h11}, whereas the data fed into BNUs should be {h00, h10} and {h01, h11}. Note that two MMUs are employed to process two different codewords concurrently without stalls. Therefore, the LDPC decoder is not only area-efficient but the decoding speed is compatible with the fully parallel architecture. The detail ideas about the designs of MMUs will be introduced in the following.

The input buffer is a storage component that receives and keeps channel values for iterative decoding. Note that it only connects to the BNUs to get less routing congestion as discussed in Sec. 4.2.1.

buf-0 buf-1 buf-2 buf-3

B A

Fig. 4.15 The proposed LDPC code II decoder architecture

4.3.1 Input Buffer

Input buffer provides the channel values to the BNUs for iterative decoding. Because two different codewords are processed concurrently, total 1200 × 2 = 2400 symbols should be stored in the input buffer. According to the partition in Fig. 4.14, the buffer is divided into four sub-blocks, where each sub-block contains 600 channel values. The conventional design is illustrated in Fig. 4.16. Four sub-blocks, buf-0 ~ buf-3, are all connected to the channel

value inputs, and multiplexers are employed to switch appropriate values into the BNUs. Thus the signal routings are all “global”, meaning that all the connections are related to the inputs and outputs (I/O) of the buffer. The global connections and the multiplexers will lead to serious routing congestion.

buf-0 buf-1 buf-2 buf-3

Channel value inputs

To BNU

Fig. 4.16 The conventional architecture of input buffer

Fig. 4.17 shows the buffer structure based on register exchange (RE) approach and the operational timing diagram, where buf-0 is designed as a shift register that serially receives the channel values from inputs and the other three sub-blocks exchange the data with buf-0 sequentially. The notation E1, E2 and E3 represent the data exchange from buf-0 to buf-1, buf-2 and buf-3, respectively. During initialization, buf-0 serially receives the channel values and passes them into other sub-blocks by executing the operations E1, E2 and E3 when buf-0 is full-filled.

buf-1 buf-0

Channel value inputs

To BNU

E1 E3 E2

buf-2 buf-3

Fig. 4.17(a) The architecture of RE based input buffer

buf-0

C00 C10 C11 C01

Channel Value

Codeword 0 Codeword 1

C00 C01 C10 C11

Fig. 4.17(b) The timing diagram of RE based input buffer

For this RE based buffer architecture, the global interconnections exist only in buf-0, and all the others are “local” among sub-blocks. However, the drawback is that a large number of multiplexers are required around buf-0 to perform E1 ~ E3. Thus buf-0 becomes a routing-critical block due to the multiplexers and the global interconnections.

To overcome this problem, an architecture based on register shifting (RS) is proposed as shown in Fig. 4.18(a), where four sub-blocks are arranged in a ring. The buf-0 is a shift register that serially receives the channel values and buf-3 transports the associated channel values to BNU. The timing diagram of the RS-based input buffer is presented in Fig. 4.18(b).

Channel values of two different codewords are serially fed into buf-0, and shifted within the buffer ring when buf-0 is full-filled. Therefore, the data flow is further simplified, and the multiplexers are eliminated, leading to simple signal transfer and routing interconnections.

Fig. 4.18 The architecture and timing diagram of RS-based input buffer

Fig. 4.19 gives the comparison of the three input buffer architecture. The RS-based input buffer can save about 20% gate count and 30% interconnection wires as compared with the conventional design.

Conventional RE RS 0

1 2 3 4 5 6 7 8 9x 10⁴

gate count

number of interconnection 83825

30000

81830

24000

67855

21000

Fig. 4.19 The comparison of three input buffer designs

4.3.2 Check Node Unit and Bit Node Unit

Fig. 4.20 shows the CNU architecture for proposed LDPC code II decoder. The CNU can be divided into two parts: one is 1-bit sign-multiplication (SM) and the other is 5-bit compare-and-select unit (CS) that searches the minimal value and the second minimal value from the inputs. The new message for each bit node is a combination of the sign bit according to (4.1) and the new magnitude which is either “min” or “2^nd min” of the CS unit. The detailed architecture of CMP-9 in Fig. 4.20 is designed as that shown in Fig. 4.9 and 4.10.

The BNU architecture is illustrated in Fig. 4.21. According to (2.34) and (2.35), BNU receives the channel value and the messages linked to the same bit node. All inputs with sign-magnitude (SM) notation are firstly converted to be 2’s complement (TC) representation, and then summed to perform the updated calculation. The summed values are also clipped to

avoid overflow. Finally, the MSB of the summation of all the inputs is used to decide the

Fig. 4.20 CNU architecture of proposed LDPC Code II decoder

C₁

Fig. 4.21 BNU architecture of proposed LDPC Code II decoder

4.3.3 Message Memory Unit

Message memory units (MMU) are used to store the message values that are generated by CNUs or BNUs. The size of each MMU is 3600 × 6 bit due to the weight of the parity check matrix. To increase the decoding throughput, two MMUs are employed to concurrently process two different codewords in the decoder. The memory management strategies, described below, include multiplexers (MUX) or register exchange (RE), resulting in different level of routing complexity. The MUX based MMU architecture and the timing diagram are illustrated in Fig. 4.22.

A B C D

iteration #iiteration #(i+1)

h00 h01 h01 h11 h₀₀ h₁₀ h₁₀ h₁₁

... ...

codeword-0 codeword-1 output block MMU-0 MMU-1

(b)

Fig. 4.22 The architecture and timing diagram of MUX-based MMU

According to the partition of the matrix H in Fig. 4.14, the MMU is divided into four sub-blocks: A, B, C and D. Many multiplexers are required for the inputs and outputs due to the partially parallel implementation and the concurrent process of two different codewords.

Moreover, all the signal interconnections are related to the I/O, leading to global routings. As a result, the serious routing congestion occurs in the conventional MMU design.

To release the routing congestion problem, the architecture based on register exchange among four sub-blocks (RE-4B) is proposed as shown in Fig 4.23. In this design, only sub-blocks B, C and D capture data form data paths, and only sub-blocks A and C connect to the outputs. Thus most of global routings are transformed into local interconnections between sub-blocks, leading to a simple data flow. Moreover, the number of multiplexers is also reduced by the RE-4B based architecture.

B A D C

datapath

Fig. 4.23(a) The architecture of RE-4B based MMU

h

₁₁

iter at io n # i iter at io n # (i+ 1) ... ...

codeword-0 codeword-1 output block MMU-0

Fig. 4.23(b) The timing diagram of RE-4B based MMU

To further improve the MMU design, the register exchange scheme based on five sub-blocks (RE-5B) is proposed as shown in Fig. 4.24(a). One extra sub-block E is used as temporal memory for reducing the interconnection between other sub-blocks. In MMU-1,

在文檔中高速低密度同位元檢查碼之解碼器設計 (頁 40-0)